HEWLETT-PACKARD 



□ 



j-0 



I 



! 1 

I-.' T 
il 
a 
I 

1 1 =5) 




© Copr. 1949-1998 Hewlett-Packard Co. 



HEWLETT-PACKARD 





March 1987 Volume 38 • Number 3 



Articles 



4 Hardware Design of the First HP Precision Architecture Computers, by David A. 
Fotland, John F. Shelton, William R. Bryg, Ross V. La Fetra, Simin I. Boschma, Allan S. 
Yen, and Edward M. Jacobs The CPU is TTL. the memory is 256K DRAMs, and the processor 
pipeline executes an instruction every 125 ns. 



1Q An Automated Test System for the First HP Precision Architecture Computers, by 
O Thomas B. Wylegala, Long C. Chow, and Randy J. Teegarden The test system requires 
minimal cooperation from the unit under test. 



O A A Distributed Terminal Controller for HP Precision Architecture Computers Running 

t— I the MPE XL Operating System, by Gregory F. Buchanan. Francois Gaullier, Olivier 
Krumeich, Eric Lecesne, Jean-Pierre Picq, and Heng V. Te The DTC not only saves space in 
the SPU cabinet, but also offloads the character-oriented tasks from the host computer. 



O Q Hewlett-Packard Precision Architecture Compiler Performance, by Karl W. Pettis and 

William B. Buzbee Here's how the combination of a RISC architecture and optimizing 
compilers can outperform CISC machines. 



38 



Viewpoints— A Viewpoint on Calculus, by Zvonko Fazarinc Should the infinitesimal 
calculus be taught at all? 



Departments 



3 In this Issue 
3 What's Ahead 
35 Authors 



Editor. Richard P Dolan • Associate Editor. Business Manager Kenneth A Shawv • Assistant Editor. Nancy R. Teater • Art Director. Photographer. Arvd A Dantelson 
Support Supervisor, Susan E Wnght • Administrative Se'vices. Typography. Anne S LoPresti • European Production Supervisor, Michael Zandwijken 



2 HEWLETT-PACKARD JOURNAL MARCH 1987 

© Copr. 1949-1998 Hewlett-Packard Co. 



C Hewlett-Packard Company 1987 Printed in u S A 



In this Issue 

This issue continues our series on HP Precision Architecture topics. On 
page 4 you'll find the hardware design story of the first two members of HP's 
new generation of computers based on HP Precision Architecture. The HP 
9000 Model 840 Computer runs the HP-UX operating system and is designed 
for technical and real-time applications. HP-UX. HP's version of AT&T's 
UNIX" System V operating system, was featured in our December 1986 
issue. The HP 3000 Series 930 runs the MPE XL operating system and is 
designed for business data processing. We've received a paper on MPE XL 
and will be publishing it later this year. Both the Model 840 and the Series 
930 have the same processor, which is noteworthy because it uses a relatively old-fashioned 
integrated circuit technology, TTL, and yet achieves about four times the speed of the fastest of 
HP's previous-generation computers. It was, of course, this potential speed of RISC-like architec- 
tures (RISC means reduced instruction set computer) that triggered the development of HP 
Precision Architecture. Future computers implementing the architecture in state-of-the-art VLSI 
technology are expected to be even faster. The Model 840 Series 930 processor is the six smallest 
boards on this month's cover. The two larger boards are an 8M-byte memory module and the 
board with the handle is the system monitor. 

HP Precision Architecture is more than just a RISC architecture. For any architecture, compilers 
must be developed to provide a high-level language interface to the machine. Whether the speed 
potential of a RISC architecture is realized, particularly for commercial languages requiring many 
complex operations, is largely a function of compiler design. How the HP Precision Architecture 
compiler designers approached some of the more challenging problems is detailed in the paper 
on page 29. which also gives performance data showing how well they succeeded in solving 
these problems. 

The HP 3000 Series 930 can have a large number of terminals connected to it through intelligent 
hardware modules called HP 2345A Distributed Terminal Controllers, each of which accommo- 
dates 48 terminals. The theory, design, and operation of the DTC are described in the article on 
page 21. Queueing theory was used to predict its performance. On page 18 is a description of 
the production test system for the Model 840 and Series 930 computers. 

I can't remember ever having anything controversial in these pages, so the Viewpoints article 
on page 38 may be the first time. It's a paper presented to the Mathematics Panel of the American 
Association for the Advancement of Science by Zvonko Fazarinc of HP Laboratories. We hope 
you'll find his ideas on the teaching of infinitesimal calculus thought-provoking. 

•R. P. Dolan 




What's Ahead 

Next month's issue will feature the design of the HP 8175A Data Generator and its arbitrary 
waveform generator option. There will be two research reports, one on software reliability and 
one on surface mount solder joint failure modes. Another paper will discuss the design and 
application of HP's PL- 10 software package, a master planning tool for the semiconductor man- 
ufacturing industry. 
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Hardware Design of the First HP Precision 
Architecture Computers 

The HP 3000 Series 930 and the HP 9000 Model 840 are 
implemented with commercial TTL logic. 

by David A. Fotland, John F. Shelton, William R. Bryg, Ross V. La Fetra, Simin I. Boschma, Allan S. 
Yeh, and Edward M. Jacobs 



THE HP 9000 MODEL 840 and the HP 3000 Series 
930 are the first technical and commercial computet 
products, respectively, to use the new Hewlett-Pack- 
ard Precision Architecture. 1 HP Precision Architecture com- 
bines a simplified, RISC-like instruction set with a power- 
ful coprocessor architecture, a 64-bit virtual memory ad- 
dressing system, a new high-performance I/O architecture," 1 
and provision for multiprocessors. 

The HP 9000 Model 840 and the HP 3000 Series 930 are 
both based on the same processor, memory system, and 
I/O system. The processor consists of five printed circuit 
boards, each 8.4 by 11.3 inches, containing off-the-shelf 
TTL logic. It uses FAST'" TTL. 25-ns and 35-ns static 
RAMs, and 25-ns and 35-ns PALs'". These five boards in- 
clude the processor pipeline, which fetches and executes 
an instruction every 125 ns, a 4096-entry translation 
lookaside buffer (TLB) for high-speed address translation, 

FAST is a trademark of Fatrchild Camera and Instrument Corporation 
PAL is a registered trademark ot Monolithic Memones 



and 128K bytes of cache memory. An additional (sixth) 
board contains the hardware floating-point coprocessor. 
Each board contains about 150 ICs. 

A 20-Mbyte/s bus called the MidBus connects the CPU, 
main memory, high-speed I/O cards, and I/O channels. 
There are six memory slots, and memory comes in two- 
board 8M-byte sets. This gives a maximum of 24M bytes 
of memory. Memory uses 256K-bit nibble-mode dynamic 
RAMs with single-bit error correction and double-bit error 
detection. There are seven general-purpose I/O slots, which 
can be used for high-speed I/O cards or I/O channels. The 
I/O channel is a two-board set. Most I/O is handled by cards 
on an HP CIO bus connected to the MidBus through an I/O 
channel. The HP CIO bus is a 5-Mbyte/s I/O bus. 

A system monitor card monitors power supply levels 
and temperature and provides front-panel functions and 
system overtemperature shutdown. An access port card 
allows remote field support access for diagnosis. 

The differences between the HP 9000 Model 840 and the 



Fig. 1. The HP 9000 Model 840 
Computer is the first HP Precision 
Architecture computer lor techni- 
cal and real-time applications Its 
operating system is HP-UX. HP's 
version of AT&T's UNIX. System V 
operating system. 
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HP 3000 Series 930 are in configurability and software. 
The HP 9000 Model 840 (Fig. 1) is a technical machine. 
It runs HP-UX. HP's version of AT&T's UNIX* System V 
operating system with real-time extensions. 3 The Model 
840 is a single-bay, one-meter-high system. Its input/output 
system has one HP CIO channel, and up to 12 CIO cards. 
Terminal I/O is done using a six-port multiplexer card, 
limiting the number of terminals to sixty. 8M bytes of mem- 
ory is standard. 

The HP 3000 Series 930 (Fig. 2) is a business machine. 
It runs MPE XL. 4 a new version of HP's proprietary MPE 
operating system, and provides compatibility for existing 
HP 3000 customers. The Series 930 has two one-meter-high 
bays to provide more I/O capacity. It has three CIO chan- 
nels, and 16M bytes of memory is standard. The second 
bay contains two CIO buses and up to two HP 2345A Dis- 
tributed Terminal Controllers. 5 Terminal I/O for MPE XL 
is done using an IEEE 802.2 local area network and the HP 
2345As. each of which can handle up to 48 terminals and 
can be located near a work group using a LAN cable. 

History of the Project 

Development of HP Precision Architecture began at HP 
Laboratories in early 1982. The processor instruction set 
and virtual memory system were well-defined by the end 
of 1982. The TTL implementation project began in April 
1983. 

The project's goals were low factory cost, good perfor- 
mance, and very fast design time, since this was to be the 
software development machine for the HP Precision pro- 
gram. We used internal HP design tools for schematic cap- 

UNIX is a registered trademark of AT&T 



ture and timing analysis, and FTL, a simulator developed 
at Amdahl Corporation, for gate level simulation of the 
entire system. We did not build wirewrap breadboards, but 
went straight to printed circuit boards. 

Simulation of the processor started in the fall of 1983. 
and we had working processors by early 1984. A complete 
processor with cache. TLB. and main memory was deliv- 
ered to the software developers in July 1984. This version 
of the machine did not have an I/O channel or a hardware 
floating-point coprocessor since the architectures for these 
units were not complete. I/O was done with a parallel in- 
terface to an HP 9000 Series 200 Computer. 

This machine was sufficient for software development, 
and we built 36 of them over the next five months. This 
version used bench power supplies and had a very small 
cabinet. It ran at 30 MHz, rather than the 32 MHz of the 
final machine. Over the next six months the final cabinet, 
power system, and system monitor were designed and the 
I/O channel was completed. 

In January 1985 we put together our first lab prototype 
system. This system looked very much like the final prod- 
uct. It had working I/O channels, up to 24M bytes of mem- 
ory, and a full-speed processor. Between January and Sep- 
tember we built almost 200 for use as software development 
machines. This machine had only 32K bytes of cache mem- 
ory and a 1 024-entry TLB. It could execute about 3.5 million 
instructions per second (MIPS). 

Enhancements were added for higher performance and 
better manufacturability during 1985. Newer, denser static 
RAMs were available, so we quadrupled the size of the 
caches and TLBs. Some other minor changes were made 
to eliminate bottlenecks in the processor and the final per- 
formance rating is 4.5 MIPS. We also completed the design 




© Copr. 1949-1998 Hewlett-Packard Co. 



MARCH 19B7 HEWLETT-PACKARD JOURNAL 5 



of the floating-point coprocessor, and started building full- 
functionality production prototypes in May 1986. The first 
production Series 840 was shipped in November 1986. 

CPU Design 

The simplicity of HP Precision Architecture allowed the 
entire CPU and floating-point coprocessor to be im- 
plemented on six medium-size boards, even though it was 
designed using mostly MSI TTL. a technology with a fairly- 
low level of integration. These six boards and the six major 
buses internal to the CPU are organized as shown in Fig. 3. 

Instruction Unit 

The l-unit (instruction unit) controls the flow of instruc- 
tions. It executes branch instructions and handles traps 
and interrupts. The I-unit also creates and distributes the 
system clocks that keep all of the elements of the processor 
synchronized. Instruction execution begins when the I-unit 
creates the address of the instruction to be executed and 
sends this address to the I-cache, which contains the in- 
structions to be executed. The I-cache sends the instruction 
back on the NI (next instruction) bus, which is distributed 
to all of the processor boards. 

Instruction decoding is decentralized, with each board 
decoding only as much of the instruction as is necessary 
for that board to do its job. 

Register File Board 

The register file board supplies the operands (the values 
to be operated on) for the instruction. It maintains thirty- 
two general registers. Each register is thirty-two bits wide. 
In addition, the register file maintains copies of the twenty- 
five control registers specified by HP Precision Architec- 
ture. 

In many computer architectures, operands can be stored 



in memory. In HP Precision Architecture, all operands are 
stored in the general registers or encoded in the instruc- 
tions. The only instructions that access data memory are 
load instructions and store instructions. The addresses for 
the load and store instructions are created from values in 
the general registers and from values encoded in the in- 
structions. 

The register file board drives the register values out to 
the rest of the CPU on the X (index) and B (base) buses. 
Sometimes in a pipelined implementation like the Model 
840/Series 930 processor, a result being created by one 
instruction or data being loaded from memory is needed 
immediately by the following instruction before there is 
time to store the result or the data into a general register. 
The register file board recognizes these cases and routes 
the data around the general registers to the instruction that 
needs it. 

Execution Unit 

The E-unit (execution unit) performs arithmetic calcula- 
tions on the operands. It executes the arithmetic instruc- 
tions and creates the addresses for load and store instruc- 
tions. It contains a 32-bit ALU (arithmetic login unit) for 
arithmetic and logical calculations, a barrel shifter for shift 
instructions, and complex mask/merge circuitry for extract- 
ing and depositing bit strings. It also contains a preshifter 
on one input to the ALU. This is used in address calcula- 
tions and for special instructions used in software multiply 
routines (the Model 840/Series 930 does not execute mul- 
tiply instructions directly in hardware.) 

The E-unit sends its result back to the register file over 
the R (result) bus. If the instruction is a load or store instruc- 
tion, then the address is sent to the cache controller and 
TLB boards on the CADR (cache address) bus. The E-unit 
also creates a condition code based on its result. That con- 
dition code is sent to the I-unit to be used for conditional 




Fig. 3. Block diagram ol the CPU 
lor the HP 9000 Model 840 and 
HP 3000 Series 930 Computers. 
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branches and conditional skips. 

TLB. Cache, and Coprocessor 

The TLB (translation lookaside buffer) controls the access 
to virtual memory. HP Precision Architecture provides a 
very large global address space. Addresses in the Model 
840/Series 930 processor are 48 bits long. The architecture 
will support up to 64-bit addresses. This huge address 
space increases performance by making memory manage- 
ment easier in both software and hardware. However, it 
would be impossible to support that much physical mem- 
ory, so it's necessary to translate the large virtual addresses 
into smaller physical addresses. 

The TLB performs this translation from the virtual mem- 
ory address (the address that the processor sees) to the 
physical memory address (the address that the memory 
system sees). It also keeps track of protection information 
(i.e., which user is allowed to access which portions of 
virtual memory). 

The large global virtual address space specified by the 
architecture allows effective use of a TLB with many en- 
tries. The Model 840/Series 930 TLB has entries for 4096 
pages of virtual memory. Each page contains 2K bytes of 
code or data. This is an enormous number of entries com- 
pared to other computers, and makes the performance pen- 
alty for the virtual-to-physical translation very small |the 
penalty comes from the processing necessary when the 
TLB does not contain an entry for the virtual memory page 
that the processor is trying to access.) 

The cache controller manages two high-speed cache 
memories, one for code and one for data. In other architec- 
tures, caches, if they are used, are forced to he transparent, 
that is, invisible to the software. HP Precision Architecture 
allows explicit software management of the caches, thereby 
making it possible to separate the code and data caches, 
doubling the available bandwidth between the caches and 
the rest of the processor. 

The floating-point coprocessor works in parallel with 
the main processor to do floating-point calculations. This 
allows the processor to continue processing during the sev- 
eral cycles that it may take for the floating-point coproces- 
sor to complete the execution of a floating-point instruc- 
tion. 

Architectural Impact 

Because the architecture is simple and regular, it allows 
for a nonmicrocoded implementation such as this one. This 
means that the implementation is a fairly straightforward 
interpretation of the architecture, and architectural features 
have a large impact on the processor organization. 

For example, all instructions are exactly 32 bits in length. 
Most instructions use one or two general registers as 
operands, and the register addresses of these operands are 
always encoded in the same place in the 32-bit instruction. 
This means that prefetching of the instruction and its 
operands can be done without regard to the decoding of the 
instruction or to the execution of the previous instruction. 

As the instruction comes out of the cache and is distrib- 
uted to the processor boards, the register file board receives 
the instruction from the NI (next instruction) bus. The reg- 
ister file board immediately prefetches from the register 



file the two operands necessary for the instruction, so that 
at the beginning of the next cycle, when the rest of the 
processor is prepared to execute the instruction, the 
operands have already been obtained and are ready to be 
driven out to the rest of the processor on the X and B buses. 

Because of the simplicity of the instruction encoding, it 
was very efficient to distribute the instruction decoding 
among ail of the processor boards, with each board decod- 
ing only the piece of the instruction that applies to that 
board. Each board, then, receives the instruction from the 
NI bus as it comes out of the cache and prepares to execute 
it during the following cycle. 

Because the architecture relies so heavily on register val- 
ues for operands, the register file is central to the instruction 
flow. Each instruction begins with the prefetching of the 
instruction from the cache and the prefetching of the 
operands from the register file. From here, the operands 
fan out to the rest of the processor, which consists of several 
short, parallel data paths. 

This, again, is a result of the architecture, which heavily 
emphasizes these short, parallel data paths as a means of 
increasing performance. The E-unit. for example, has a bar- 
rel shifter combined with sophisticated mask/merge cir- 
cuitry for bit manipulation. It also contains a 32-bit ALU. 
The results from these two pieces of circuitry are never 
needed in the same instruction, allowing them to be placed 
in parallel so that neither one will impact the speed of the 
instructions that use the other. 

The I-unit can calculate branch target addresses in paral- 
lel with E-unit calculations, which allows for instructions 
that calculate an arithmetic result and conditionally branch 
based on that result, all in one cycle. This makes possible 
very efficient loops and range checking in the code. 

Another example of the parallelism encouraged by the 
architecture can be found in the cache and TLB. Because 
the architecture does not permit one physical address to 
be referenced by more than one virtual address, the cache 
and TLB accesses can be done in parallel. At the same time 
that the TLB is performing the translation from virtual to 
physical address, the cache is obtaining the data (or code), 
and reading a tag that indicates which physical address 
that data belongs to. When the TLB has completed the 
translation, the physical address is sent to (he cache over 
the PPN (physical page number) bus. and compared to the 
physical address that the cache has read from its tags to 
determine whether this data is really the data that is 
needed. In most architectures, these two processes must 
be done serially, resulting in a much longer cache access 
time. 

Processor Pipeline 

The processor is pipelined. This means that several in- 
structions are in various stages of execution at any one 
time. Whereas this implementation technique must be 
made transparent in most architectures, it is supported, 
and in fact encouraged, by HP Precision Architecture. 

The pipeline has three stages, as shown in Fig, 4. Each 
stage takes 125 ns, and is subdivided into two minor stages. 
The firsl stage is referred to as the fetch stage. During the 
first half of this cycle the address of the instruction is sent 
to the instruction cache from the l-unit. During the second 
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half, Ihe instruction is returned and distributed, instruction 
decoding is begun, and the register operands are read out 
of the register file. 

The second stage is referred to as the execute stage. It is 
during the first half of this stage that the arithmetic result 
is calculated if this is an arithmetic instruction, or the data 
address is calculated if this instruction is a load or store 
instruction. If this is a branch instruction, the branch target 
address is also calculated during the first half of this stage. 
For arithmetic instructions and for conditional branch in- 
struction, the condition is calculated during the second 
half of this cycle. For loads and stores, the second half of 
the cycle is used to drive the data address to the cache 
controller and the TLB. 

Notice that if the instruction is a branch instruction, 
because of the pipeline the following instruction is being 
fetched while the branch target is being calculated. In most 
architectures, this would result in an instruction being 
fetched that was not going to be executed, and would result 
in a wasted cycle with every branch. In HP Precision Ar- 
chitecture, branches are delayed by one instruction. In 
other words, one additional instruction is executed after 
the branch instruction before the branch target is reached. 
This is an example of how the architecture supports 
pipelined implementations, resulting in a performance im- 
provement over traditional architectures. 

The third pipeline stage is referred to as the load/store 
stage. During this stage, load data is returned from the 
cache, or stored data is written to the cache. It is during 
the first half of this stage that the E-unit result is actually 
written into the register file, and during the first half of 
the following cycle that the load data is written into the 
register file. Notice that the register file is only written 
during the first half of any cycle. This is because during 
the second half cycle it is necessary to read from the register 
file the operands being used by the instruction that is cur- 
rently in its fetch stage. 

Cache and TLB Design 

The cache and TLB (translation lookaside buffer) speed 
up memory accesses by keeping recently accessed data and 
virtual address translations in local high-speed memory. 
They are designed to give the best performance without 



increasing the CPU's basic cycle time. They take advantage 
of the architecture, which allows a simple design, and they 
are pipelined to get increased bandwidth without increased 
hardware. 

The cache is a high-speed memory that shortens typical 
main memory access times by keeping copies of the most 
recently accessed data. The cache is divided into a 64K-byte 
instruction cache and a 64K-byte data cache, each of which 
is divided into 4096 16-byte blocks. Each block has an 
address tag that specifies the block of memory from which 
it came. When the processor accesses data or instructions, 
the block is copied from main memory into the instruction 
or data cache, as appropriate. All further references use 
the copy in the cache, until the cache block is needed for 
a different block from memory. At that point, the block 
that is removed is written back to memory only if it has 
been modified by the processor. 

Similarly, the TLB speeds up virtual address translations 
by acting as a cache for recent translations. Both virtual 
memory and physical memory are divided into pages of 
2K bytes, and each TLB entry maps a virtual page number 
to a physical page number. Each virtual address is made 
up of a virtual part (i.e., the virtual page number) and a 
physical part (i.e.. the offset within the page). The TLB 
translates the virtual page number to get a physical page 
number. This is concatenated with the page offset to gen- 
erate the physical address. 

To allow the implementation of large, high-speed caches, 
HP Precision Architecture disallows address aliasing, that 
is, Ihe capability of having two different virtual addresses 
pointing to the same physical location. This allows the 
Model 840/Series 930 processor to access the cache and 
TLB in parallel, without the problems and constraints this 
has had in other architectures. Thus, the access time is the 
worst of the TLB and cache access times, rather than the 
sum of them (see Fig. 5). 

TLB Operation 

The virtual memory space is divided into virtual pages 
of 2048 bytes each. A data structure in main memory called 
the page table keeps information about each virtual page 
that is currently in use (i.e.. a copy of that page exists in 
the physical memory). This table has one entry for each 
virtual page, and contains information on the protection 
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of that page (which users are allowed what kind of access 
to the page) and the physical page number of that page 
(where the page exists in physical memory). 

The TLB acts as a cache for this information. Just as the 
instruction and data caches keep copies of recently ac- 
cessed memory locations, the TLB keeps copies of the in- 
formation for recently accessed pages. The Model 840' 
Series 930 TLB has entries for 4096 pages. 2048 for code 
and 2048 for data. 

With each memory access (both instruction fetches and 
data loads and stores), the TLB checks the protection infor- 
mation for that page and the physical address of that page. 
It sends the physical address to the cache, so that the cache 
can check to see if the word being accessed is contained 
in the cache. If the protection information indicates that 
the user is not allowed the kind of access being attempted, 
the TLB signals the register file board to back out of the 
instruction and signals the I-unit to raise a trap. 

TLB miss handling is done in software on the Model 
840/Series 930. This means that if none of the entries in 
the TLB corresponds to the virtual address being accessed, 
a trap is raised. Software must then get the information 
about that page from the page table, place it in a TLB entry, 
and reexecute the instruction. 

The TLB is pipelined so that it can perform both an 
instruction address translation and a data address transla- 
tion each cycle. The first half of every cycle, it reads the 
tag and translation out of the TLB for the instruction being 
fetched (see Fig. 6). During the second half of the cycle, it 
checks the tag for a match (also known as a hit) and per- 
forms a protection check. 

Also during the second half of each cycle, the tag and 
translation are read out of the TLB for any memory access 
instruction currently in the execute phase. This is checked 
for a match and protection during the following half cycle, 
that is, the first half cycle of that instruction's load/store 
phase. Thus, each half cycle the TLB starts a new transla- 
tion, which will be completed one cycle later. 

The TLB is a direct mapped TLB. II has 2048 entries for 
instructions and 2048 entries for data. Direct mapped 



means that each virtual page translation can exist in only 
one entry of the TLB. and if a program accesses another 
page that maps to the same entry, the first one is replaced. 
Although direct mapped TLBs (and caches) have greater 
miss rates than set-associative TLBs of the same size, the 
direct mapped TLB minimizes cycle time, which has a 
greater impact on performance. 

The TLB is addressed by the 9 LSBs of the virtual page 
number hashed with the 1 1 LSBs of the space ID to create 
an 11-bit index. The hash takes the two LSBs of the space 
ID, then flips the next 9 bits of the space ID around and 
exclusive-ORs them with the 9 LSBs of the virtual page 
number. In addition, one address bit selects instruction or 
data, since the TLB is split for instruction and data accesses, 
although these use the same hardware. The address hash 
reduces the likelihood of a program's heavily using a single 
TLB entry for two different pages. In addition, the large 
size of the TLB greatly reduces overall miss rates. 

In addition to the physical page number of the page trans- 
lated and a virtual tag to identify the corresponding virtual 
page, each TLB entry has extra information to implement 
the HP Precision protection scheme. This includes an ac- 
cess rights field, an access ID. and several status bits. The 
access rights field specifies how the page can be used (e.g.. 
read only, read/execute, etc.) and the necessary privilege 
level to use it. The access ID is a key that must match one 
of the four protection IDs (in control registers) that pro- 
cesses can have, in addition to the access rights check. The 
status bits include an entry valid bit, a dirty bit, two differ- 
ent break-on-access bits, and an I/O bit which marks 
whether the page corresponds to an I/O module. 

Since TLBs do not contain the translations for all pages 
in memory simultaneously, they occasionally do not have 
the desired translation and a miss occurs. The instruction 
cannot complete without the translation, so the TLB causes 
a trap. This causes the current processor state to be saved 
in control registers, and execution continues in the software 
TLB miss handler. If the page is actually in memory, the 
handler will insert the needed TLB entry and retry the 
offending instruction. If the desired page is not in memory 
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(i.e., on disc), the handler will invoke the page fault han- 
dler. Because the instruction is restarted after a TLB miss, 
a TLB must be able to contain two arbitrary translations to 
be able to complete any instruction, one to fetch the instruc- 
tion and one to access data requested by the instruction (if 
load or store). If it cannot, there are cases where the program 
will get stuck first replacing the entry for fetch, then for 
data, and back again. Hence the TLB is split, half for instruc- 
tions and half for data, to guarantee forward progress. 

One side effect of the very large HP Precision address 
space. 256 terabytes (2 48 bytes) in this case, is that all pro- 
cesses execute in the same unified address space. This 
allows different processes (or programs) 10 share code or 
data more simply, since the addresses are the same. Also, 
since there is only one address space (as opposed to sepa- 
rate address spaces for each user), the operating system 
does not have to flush all entries out of the TLB whenever 
it switches processes. This cuts down on the number of 
TLB misses. 

Cache Operation 

The cache tags and control, like the TLB. are pipelined. 
At the same time the TLB entry is read, the cache tags are 
read, and the cache tag comparison is performed when the 
TLB tag comparison and protection check are performed, 
both for instructions and for data. However, the data RAMs 
(for both the instruction and the data caches) are not 
pipelined. Instead, the data RAMs for the instruction cache 
and the data cache are implemented using separate RAMs 
so that the access can span an entire cycle. This allows the 
use of larger, slower RAMs for the data, without affecting 
the cycle time. Otherwise, the cycle time would have to 
be long enough to allow reading the slower data RAM in 
a half cycle. 

Like the TLB, the instruction cache and data cache are 
both direct mapped to minimize cycle time. They are as 
large as possible with current high-speed RAMs, thus keep- 
ing down miss rates. In addition, there are several features 
that allow the processor to keep running even though the 
cache is servicing a miss. 

In the case of a data cache miss, the cache allows the 
processor to continue running until either the processor 
needs the data (from a load) or the processor executes 
another cache access. The first case is called a load/use 



interlock, and occurs when the cache receives a load in- 
struction for a particular register, and before the cache can 
supply the data, it receives another instruction that uses 
that target register. This is detected by comparing the load 
target for any load instruction in progress with the register 
fields of instructions being fetched, and causing the proces- 
sor to freeze if there is a match. As soon as the data is 
returned to the processor, the interlock goes away and the 
processor can continue. 

When there is a cache miss, the cache receives the data 
from main memory in a 4-word block, one word per cycle. 
To speed things up, as soon as the cache receives the re- 
quested data, it passes it through to the processor while it 
is also putting it into the cache. This allows execution to 
continue, even though the cache might still be servicing 
the miss. 

When there is an instruction cache miss, the cache 
freezes the processor immediately, since it needs the in- 
struction to continue. However, as soon as it receives the 
requested instruction from memory, it passes the instruc- 
tion and the following instructions through to the processor 
to allow it to continue. The processor continues receiving 
the instructions straight from memory until either the end 
of the block is reached or the processor executes a taken 
branch. At the end of the block, the processor goes back 
to getting its instructions normally, from the cache. If the 
processor branches, the cache will freeze the processor 
until it finishes servicing the miss, then allow execution 
to continue. These optimizations improve performance by 
reducing the average cache miss penalty. Cache perfor- 
mance is measured by measuring the total miss penalty, 
which is the product of the miss rate and the penalty for 
each miss. The miss rate is minimized by making the cache 
as large as possible. The miss penalty is minimized by 
allowing the processor to execute whenever possible, even 
during cache miss servicing. 

HP Precision Architecture allows a somewhat simpler 
cache than would otherwise have been possible, by putting 
the burden on software to keep the instruction and data 
caches consistent with each other and with any I/O being 
performed. The architecture provides cache flush and 
purge instructions, which software can use to guarantee 
that the copy in memory is up to date. Thus, the hardware 
does not have to check for a program modifying instruc- 
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tions. or whether DMA is accessing data that is currently 
in the cache. 

Performance 

Since this machine was the first HP Precision Architec- 
ture machine, we thought that it should be instrumented 
for hardware performance measurements. Thus the analy- 
sis interface card was born. This card is a coprocessor, and 
it has two functions. First, it depipes the instructions and 
presents the data to a front plane interface to a logic analyzer. 
It can show three 32-bit buses and some control signals to 
the analyzer. The three buses can be chosen to show the 
instruction address, instruction, either source register, the 
ALU result, the load/store address, or the load/store data. 
Control signals indicate whether the instruction was exe- 
cuted or nullified, or if a taken branch or a trap occurred. 

The analysis interface card and a disassembler for the 
HP 64000 Logic Development System were used extensively 
for both hardware and software debugging. The interface 
card was also used to take instruction traces for running 
performance simu lations for other machine organizations. 

The other function of the analyzer card is to collect per- 
formance statistics. It contains five 32-bit counters. Three 
of the counters can each count one of 32 predefined events. 
The other two counters form a pair, only one of which is 
readable by the processor. This pair of counters can count 
one of the 32 events in four ways: simple counting (like 
the other three), maximum duration of the event, number 
of times the event duration exceeds a threshold, or event 
occurrences masked by a one-zero-don't care comparator 
on one of the CPU buses. For example, the data cache miss 
rate can be measured by having one counter count data 
cache accesses and another count data cache misses. These 
statistics helped confirm the results of the cache and TLB 
simulations that were used to make trade-offs when the 
machine was being designed. 

Events that can be counted are: 

■ Cycles 

■ Fetched instructions 

■ Executed instructions 

■ Loads 

■ Stores 

■ Branches taken 

■ Branches not taken 

■ Branches nullifying next instruction 

■ Arithmetic operations nullifying next instruction 

■ All architectural nullified instructions 

■ All nullified instructions 

■ External interrupts 

■ Traps 

■ Instruction cache misses 

■ Data cache misses 

■ Dirty data cache misses 

■ Instruction cache accesses 

■ Data cache accesses 

■ Instruction TLB misses 

■ Data TLB misses 

■ Instruction TLB accesses 

■ Data TLB accesses 

■ I/O accesses 



■ Load'use interlocks 

■ Cache wait cycles 

■ Coprocessor wait cycles 

■ Interlock and wait cycles 

■ Time spent at privilege level 0. 1. 2. or 3 

■ Time spent with interrupts off 

■ Time spent in virtual code space 

■ One write port interlocks. 

MIPS Calculations 

A frequently used measure of raw CPU power is millions 
of instructions per second, or MIPS. The MIPS rating of a 
computer is calculated as one over the cycle time times 
the cycles per instruction: 



cycle time [(is) X CPI 

MIPS ratings are a good measure of performance when 
comparing machines with the same computer architecture, 
but can be misleading when comparing different architec- 
tures, since one architecture may take fewer instructions 
to complete the job than another. The best way to compare 
machines is to run the same application on both machines. 
Even on machines with the same architecture, the MIPS 
rating is calculated using a standard instruction mix or 
measured when running a standard jobstream. If the appli- 
cation differs from the standard then the MIPS rating might 
not be a good predictor of performance when running the 
application. The instantaneous MIPS rate of a computer is 
quite variable, as can be seen from Fig. 7, so any MIPS 
number quoted is only a long-term average. MIPS rates 
vary with the job executed and with time. MIPS ratings 
also have the drawback of not taking into account operating 
system efficiency, compiler efficiency, or I/O system effi- 
ciency. Therefore, MIPS is not a good metric for predicting 
applications performance on systems running different 
operating systems. 

Cycles per instruction, or CPI, is the key measurement 
of how well a CPU uses its available power. Ideally, one 
instruction per cycle will be executed, so the CPI will be 
one. Calculating the CPI is straightforward: sum up the 
product of the penalties and their frequency, and add one: 

CPI = 1+5! PiF, 

There are two ways of minimizing the CPI: reduce the 
penalty (P|) or reduce the frequency (F|). 

Model 840 MIPS 

For the HP 9000 Model 840 and HP 3000 Series 930 
Computers, the cycle time is 125 ns. If there were no penal- 
ties, all instructions would take one cycle (CPI = 1). so 
the maximum possible performance is 8 MIPS. 

The measured MIPS rate for the Model 840 varies from 
about 3.5 to 8 MIPS with an average of 4.5 to 5. (Fig. 7 
shows measured performance during typical operation.) 

The MIPS rating for the Model 840 can be calculated 
using the statistics gathered with the analysis interface card 
described above. First, there are nullified instructions. This 
happens when the compiler can't schedule a branch, and 
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when an arithmetic instruction skips on a condition. About 
7% of all instructions are branches specifying nullify. 
About 2% of instructions are conditional skips that are 
taken. The load/use interlock happens when a load instruc- 
tion is followed by an instruction that uses the data re- 
turned. About 14% of the instructions have this interlock. 
As the optimizing compiler improves, these numbers 
should go down. The Model 840 has an interlock that most 
HP Precision Architecture machines won't have: the one 
write port interlock. Since the load instruction result comes 
back one cycle after the load is executed, the architecture 
expects the register files to be able to write two results at 
the same time, one from the load instruction, and one from 
the ALU. The Model 840/Series 930 register file is built 
from static RAMs and can only store one result at a time. 
If a load instruction is followed by an instruction that stores 
a result, there will be a one-cycle interlock. This happens 
on about 8% of instructions. 

Every instruction has the possibility of missing the I- 
cache. Nullified instructions are fetched, so they can miss 
the I-cache also. The 1-cache miss rate is about 3% and the 
miss penalty is 5 to 8 cycles with an average of about 7. 
The contribution to the CPI can be calculated by multiply- 
ing the miss rate (0.03) times the penalty (7) times the 
frequency of access (1.09). The frequency of access is 1.09 
because for every one access per executed instruction, there 
is 0.09 access (9%) because of nullified and skipped instruc- 
tions. This makes the CPI contribution 0.23 (see "Cache 
Performance," below). Every instruction can also miss the 
I-TLB. TLB misses cause a trap and are handled in software. 
This software is not doing any useful work so it is not 
counted as instructions executed for calculating MIPS. The 
miss handler is about 40 instructions and takes an average 
of about 70 cycles. The l-TI.B miss rate is about 0.05%. so 
the CPI contribution is about 0.04 (0.0005 times 70.) Load 
and store instructions can miss the D-cache and the D-TLB. 
Loads are about 25% of the instruction mix and stores are 



about 15%. A D-cache miss can cost anywhere from 1 to 
15 cycles with an average of about 8. The D-cache miss 
rate is about 3%. The D-TLB miss rate is about 0.1%. The 
D-cache contribution to the CPI is about 0.1. and the D-TLB 
contribution is about 0.03. There is also a CPI contribution 
from flushing the cache for I/O. This is another function 
that most machines do in hardware, so it should not be 
counted in the instructions executed. The contribution to 
the CPI is about 0.04. The CPI is therefore: 
Basic instruction 
Nullify 
Load/use 
One write port 
I-cache miss 
I-TLB miss 
D-cache miss 
D-TLB miss 
I/O cache flush 



Total CPI 



1.00 
0.09 
0.14 
0.08 
0.23 
0.04 
0.10 
0.03 
0.04 

1.75 



This gives 4.57 MIPS as the calculated performance. 

Cache Performance 

Looking at the numbers in the preceding section, we can 
easily see that the cache is the major contributor to the CPI 
on this machine. 

The function of a cache is to make the memory look 
faster by remembering recent memory accesses in the hope 
that they will be used again. Most programs have some sort 
of locality, that is. memory references in the recent past 
are likely to be used again in the near future. If this were 
not true, the cache would not work. 

Caches normally take some amount of time, usually one 
cycle, to fetch data. This is how high performance is 
achieved: memory normally takes more than one cycle. Of 
concern are the less frequent cases when the cache takes 
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more than one cycle to return the needed data. An example 
of this is when a piece of data is used for the first time. It 
cannot be in the cache, and must be fetched from memory. 
It is these less frequent cases that determine cache (and 
processor) performance. 

The largest, and frequently only, penalty associated with 
caches is the miss penalty. It is the time the cache requires 
to get information (data or instructions) from memory. The 
cache needs to start a memory transaction, get the data 
back, and save it. All this time is counted as the miss 
penalty. 

Effects of Separate Caches 

The processor is capable of fetching an instruction every 
machine cycle, and some of these cycles, about 40%, also 
require a data cache reference. To prevent a large CPI in- 
crease (about 0.4). it is necessary to use the instruction and 
data caches during the same cycle. If the caches are com- 
bined (instruction and data cache are the same), the cache 
must be accessed twice per cycle. This becomes difficult 
to do with large caches, because the RAMs are just not fast 
enough. Splitting the cache into an instruction cache and 
a data cache allows both caches to be used in the same 
cycle without making them faster. 

To prevent thrashing (excessive cache missing), a com- 
bined cache should also be at least two-way (two-set) as- 
sociative. However, multiway associativity also slows 
down the cache, making it more difficult to build. Splitting 
the cache also eliminates this problem. 

The decision to have separate instruction and data caches 
allowed us to use a large cache; a 128K-byte combined 
cache would have been difficult and expensive to build. 
The split cache does have a slightly lower hit rate than the 
combined cache, but this presented less of a problem than 
alternative cache organizations. 

The cache is direct mapped, which means that each vir- 
tual address has exactly one place it can go in the cache. 
This is also known as a one-way associative cache. In two- 
way or four-way associative caches, a single virtual address 
can go in two or four places in Ihe cache. Any number of 
ways can be built, but the expense is great, so normally, 
two-way or four-way associativity is used. Since each vir- 
tual address can go in more than one spot, there is less 
likely to be a conflict between two addresses. Thus the 
miss rate of Ihe cache is lower. Simulations of the Model 
840/Series 930 cache show that the miss rate would be 
about 40% lower if a four-way associative cache had been 
used. But this would have cost nearly four times as much 
in hardware, and would have required an increase in the 
cycle time of the machine, since the cache access takes 
longer when the cache must determine which way (set) 
contains the data. The benefits did not seem commensurate 
with the cost. 

Write-Through versus Write-To 

When the processor writes to the cache, the cache can 
do the write to memory at once (a write-through cache), or 
it can save the data and do the write to memory later (a 
write-to cache). The write-through cache has the benefit of 
always having a correct copy of Ihe data in memory, some- 
thing nice for I/O. A write-to cache reduces the bus traffic 



considerably, since only a small number of writes generate 
a bus transaction. 

Keeping the memory up to date is a problem that HP 
Precision Architecture leaves to software. That removes 
most of the addtional complexity normally associated with 
the hardware on a write-to cache, and made the choice of 
a write-to cache clear. 

Caches are something that software has traditionally not 
been able to control. But HP Precision Architecture pro- 
vides explicit instructions for the software to maintain the 
caches. These are the purge and flush instructions, and all 
their variations. A purge removes the information from the 
cache without saving anything. A flush does the same, 
except that it writes the contents back into memory (for a 
write-to cache) before destroying it. The instructions come 
in two flavors, one for the instruction cache and one for 
the data cache. There is no purge instruction cache instruc- 
tion, since a program can never write to (change) the in- 
struction cache. 

Why does this make a difference for performance? The 
explicit purge and flush instructions take time to execute. 
Traditionally this time has not been required. But the 
hardware complexity is significantly less, and therefore we 
can build faster caches. Although it is difficult to measure, 
we believe that the cost of the explicit instructions is less 
than the performance gain from the simplicity of the design. 

Memory transactions i n the Model 840/Series 930 proces- 
sor occur on the MidBus, and are 16-byte (4-word) reads 
and writes, named READ16 and WRITE16. After a cache miss, 
the bus states are: 

READ16 WRITE 16 

(cache miss) (cache miss) 

Address Address 
Dead Cycle Dead Cycle 

Tristate Data 0 

Data 0 Data 1 

Data 1 Data 2 

Data 2 Data 3 

Data 3 Tristate 
Tristate 

The basic bus cycle is the same as the processor's. 125 
ns. Two things are key: the latency and the data rate. The 
latency is how long it takes to get the first word of data 
back (time from address to first data). The data rate is how 
fast the data comes back once it starts going. If the latency 
is high, the miss penalty will be high. It may be best to 
load more data in that case. If the data rate is the same as 
the instruction rate, bypassing the instruction cache can 
make a lot of sense. In general, the latency is largely deter- 
mined by the bus and the memory (dynamic RAMs) used. 
It is hard to reduce. The data rate is also determined by 
the bus, but it is easily controlled by adjusting the bus 
width and cycle time. 

Effects of Bypassing 

Cache line bypassing is the concept of using the instruc- 
tions or data as they are loaded inlo the cache, rather than 
waiting for the cache miss to finish. This reduces the pen- 
alty of the cache miss, but it has a few problems, too. In 
the Model 840/Series 930 processor, about three-quarters 
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of the instruction cache misses occur on the first word of 
a line, and the rest are evenly distributed among the other 
three words. The instruction cache penalty on the Model 
840/Series 930 processor is 9 cycles with no bypassing, but 
is from 5 to 8 cycles if bypassing is used. The effective 
miss penalty is calculated by summing 75% of 5, 8% of 6, 
8% of 7. and 8% of 8. This comes out to 5.12, much better 
than 9. 

When actually measured, the effective miss penalty is 
5.35, close to the calculated value. Comparing these num- 
bers (still incomplete) at this point gives a CPI contribution 
of 0.16 with bypassing, and 0.27 without bypassing. These 
numbers were calculated by multiplying the I-cache miss 
rate (assumed 3%) by the respective miss penalties. 

Some other things must be considered. If the processor 
is not executing the instructions in the same order and at 
the same speed as they are coming into the cache, the 
analysis breaks down. The Model 840/Series 930 processor 
is capable of doing this, but there are two important excep- 
tions: a processor freeze and a taken branch. If the processor 
freezes for any reason other than the instruction cache miss, 
the instructions coming from memory are now too early 
for the processor. In the case of a branch that is taken, the 
instructions coming from memory simply are not the 
proper instructions. In both cases, the processor is refrozen 
and the instructions from memory are ignored. However, 
this is what happens anyway; there is no additional penalty 
for bypassing. On the Model 840/Series 930 processor, this 
refreeze adds 1.90 cycles to the effective I-cache miss pen- 
alty. This brings the total miss penalty to 7.25. Recalculat- 
ing the CPI, a bypassed I-cache adds 0.22, while no bypass- 
ing adds 0.27. The processor performance gain by using 
bypassing is 0.05 CPI multiplied by 8 raw MIPS, or 0.40 
MIPS, about a 10% gain. 

Bypassing is also done on the data cache. Here it is not 
nearly so important. Since the cache can only handle one 
cache operation at a time and the cache operation is not 
finished until the miss has been handled, bypassing only 
works for one access per line. The CPI contributions, calcu- 
lated in a similar way as above (but more complicated) are 
0.10 for bypassing versus 0.13 for no bypassing. Bypassing 
is done on the data cache simply because it was easy to 
implement. The logic existed for the instruction cache 
(where it makes a larger difference), and was easy to mul- 
tiplex between the two caches. 

The following table summarizes the bypass and no- 
bypass performance calculations. 



I-cache 
D-cache 



no 
bypass 

(CPI) 
0.27 
0.13 



bypass 

(CPI) 
0.22 
0.10 



difference 

(CPI) (MIPS) 
0.05 0.40 
0.03 0.24 



Bypassing has one other problem. If the bus supports 
retries (a third party requests that the current bus transac- 
tion be ignored and tried again), the retry must be known 
before the first data word is used. Normally this means 
that the retry signal must be present with or before the first 
data word. Because of control complexity, the retry signal 
must be present on the MidBus one cycle before the first 



data word in the Model 840/Series 930 processor. 

Critical Word First, Line Size, and Cache Size 

Critical word first is an idea that only makes sense with 
cache line bypassing. It is the idea of rearranging the data 
on the transaction so that the needed word comes first, and 
the rest come later. For example, if the second word (word 1) 
is needed first, the memory would return the data in this 
order: word 1, word 2, word 3, then word 0. Talcing a look 
back at the miss penalty for the instruction cache (5.35), 
we can see that this doesn't make a lot of sense. A miss 
penalty of 5 is the best we could do. Critical word first also 
potentially introduces a penalty on the 32-byte memory 
transactions, a discussion of which is beyond the scope of 
this paper. The Model 840/Series 930 cache and memory 
do not support critical word first. 

Caches load one line at a time from memory. How big 
should this line be? Typically, the larger the line size, the 
more efficiently it can be loaded from memory. But with 
large lines, it is more likely that words will be loaded that 
will never be used. Since the instruction cache is normally 
used in a regular way, it benefits more from a large line 
size than the data cache. The Model 840/Series 930 proces- 
sor uses a line size of 1 6 bytes (4 words) for both the instruc- 
tion and the data caches. A line size of eight words would 
have been better for the instruction cache, but would have 
been difficult to implement since the Model 840/Series 930 
has combined, pipelined cache tags. 

Determining how big to build a cache is sometimes dif- 
ficult. The larger the cache, the lower the miss rate will 
be, as long as you stay in the same global address space. 
Some processors do not have a single address space large 
enough to handle multiprocessing. Operating systems can 
get around this problem by flushing the TLB on process 
switches. Sometimes it is necessary to flush the cache, too. 
If the processor must flush the cache on a process switch, 
there comes a point where building larger caches may not 
help, and may even hurt system performance. This would 
put a practical limit on the size of TLBs and caches in such 
systems. The larger they are. the longer they take to flush. 
Also, the larger they are. the less likely they are to be fully 
used before a process is switched out again. HP Precision 
Architecture solves this problem by having a single large 
address space. All processes share this common address 
space, and no flushing needs to be done on either the TLB 
or the cache during process switches. With HP Precision 
Architecture, larger caches are always higher-performance, 
as long as the cycle time is not affected. 

Floating-Point Coprocessor 

In HP Precision Architecture, floating-point operations 
are handled by a coprocessor. This coprocessor runs con- 
currently with the main CPU and has the sole job of sup- 
porting floating-point arithmetic. Floating-point arithmetic 
is well-suited for a coprocessor because it involves calcu- 
lations that require multiple cycles to perform. This means 
that although the floating-point instruction occupies one 
position in the instruction stream, the main CPU can re- 
ceive and execute subsequent non-floating-point instruc- 
tions concurrently with subsequent cycles of the floating- 
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point instruction. 

As a true coprocessor, the floating-point board decodes 
its own instructions. It also has its own set of sixteen 64-bit 
registers. Data is sent to the floating-point board by loads 
from the cache and results are retrieved via stores to the 
cache. The floating-point operations are performed accord- 
ing to the IEEE standard for binary floating-point arithme- 
tic. 

The floating-point board is not equipped to handle all 
floating-point operations required, nor all floating-point 
number values. When an unsupported operation or data 
value is detected, the floating-point board indicates this 
fact to the main CPU, which then handles the operation in 
software. The occurrence of these operations requiring soft- 
ware is rare and does not significantly affect performance, 
but does ensure that the IEEE standard is completely satis- 
fied. 

Overlapped Processing 

During the initial investigation for the floating-point 
board, a simple design was considered, which did not allow 
overlap of any type of floating-point instuctions. However, 
several types of floating-point applications (e.g.. matrix 
multiply, vector sum, 3x3 graphics transformation, etc.) 
were examined and estimates of performance were made. 
It quickly became apparent that the initial design did not 
meet the performance goals. Furthermore, in most of the 
applications examined, allowing nonconflicting floating- 
point loads and stores to be processed while floating-point 
operations (flops) were being executed increased the per- 
formance enough to meet the goals. Therefore, the floating- 
point coprocessor implements this capability. 

Once the capability of the board was decided, the design 
was implemented in a simple and straightforward manner. 
It was determined that microcode would be used to execute 
flops, while floating-point loads and stores would be im- 
plemented completely in hardware. A proprietary HP float- 
ing-point add. multiply, and divide chip set used in the 
HP 9000 Model 550 was selected to do Ihe floating-point 
calculations. Also, a special assembler was written to con- 
vert the source microcode into a listing file and a burn file, 
with the latter being used to program the microcode 
PROMs. 

The design choices led to the partitioning of Ihe floating- 
point board into two distinct parts, the micromachine and 
the state machine (see Fig. 8). These two control pieces 
share a common data bus and both have access to the regis- 
ter file. They share these data paths on a time basis, with 
each in control for half of the 125-ns cycle time. The state 
machine does all of the interfacing with the main CPU and 
provides control for floating-point loads and stores. It also 
dispatches the micromachine to execute flops and signals 
Ihe main CPU when traps or freezes are necessary. 

The organization of the floating-point board into these 
two distinct control blocks is well-suited for allowing float- 
ing-point loads and stores to be processed while flops are 
in progress. When the state machine receives a floating- 
point load or store it simultaneously determines whether 
the micromachine is busy, and if so. determines whether 
the new load or store conflicts with the flop in progress 
by accessing a floating-point register used in the flop. If 



there is no conflict, the state machine processes the float- 
ing-point load or store immediately; otherwise, it waits 
until the flop is completed. 

Self-Test 

A full self-test strategy is built into the floating-point 
board design. A special, machine dependent instruction 
was created, which when executed causes the micro- 
machine to perform a detailed diagnostic test of the float- 
ing-point board. This test signals pass or error conditions 
by setting bits in one of the floating-point registers. It also 
puts the one's complement of these bits into another regis- 
ter so that the validity of the information in the first register 
can be verified- Also included in the self-test strategy is 
the ability of the main CPU to distinguish between a mal- 
functioning floating-point board and the lack of a floating- 
point board in the system. 

At power-up. a self-test is run on the floating-point board. 
The test includes the microcoded self-test as well as one 
instruction of each type of floating-point operation to en- 
sure that the board is performing correctly. If this is deter- 
mined to be the case, then the system is configured with 
the floating-point board enabled. If this is not the case, an 
error is signaled during initialization. The floating-point 
card can then be removed and the system rebooted without 
the floating-point board enabled. Then all floating-point 
operations are performed in software. 

Development Methods 

Once the floating-point hardware was solid enough to 
run in a system environment, a software test package was 
written to aid in catching numerical errors. This package 
allows us to test any flop with any operands. It also offers 
a pattern mode, during which operands are generated in a 
user-controlled pattern and continually used in flops. It 
functions by checking results received from the floating- 
point hardware with results derived from software floating- 
point routines. During development, if discrepancies were 
found between the two results, the operands, operation, 
anil results were logged to an error file. In the pattern mode 
of operation, this software package caught a few numerical 
flaws both in the floating-point hardware and in the soft- 
ware floating-point routines. The mistakes in the software 
floating-point routines were equally important to correct 
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because in the absence of hardware, they would be used 
to do the calculations. 

To get the maximum performance from the floating-point 
board, the compiler groups and especially those connected 
with the optimizer were brought into the design process 
iii the early stages. They were given information on how 
long each flop took to execute and what constitutes a con- 
flicting load or store versus a nonconflicting load or store. 
Using this information along with their knowledge of the 
operating characteristics of the main CPU, they altered their 
compilers and optimizers so that the instruction ordering 
and floating-point register allocation they generated would 
lake the best advantage of the floating-point hardware that 
existed. 

The inclusion of the floating-point coprocessor with the 
main CPU allows the hardware to be used in a technical 
environment. Its position in the system as a coprocessor 
enhances the overall system performance by allowing the 
CPU to do non-floating-point activities such as address 
generation and integer arithmetic while multicycle flops 
are in progress. The floating-point board's ability to do 
floating-point loads and stores concurrently with flops 
means that while one result is being calculated, previous 
results can be stored and operands for future flops can be 
loaded. In summary, the design of the floating-point board, 
its placement in the system, and its influence on the com- 
pilers and optimizers all serve to get the maximum techni- 
cal performance from the technology used to implement 
the design. 

Memory System 

The Model 840/Series 930 memory system is designed 
to maximize performance by keeping the latency from ad- 
dress to first word as small as possible. All signals were 
potential critical path signals and had to be analyzed care- 
fully to ensure that the timing goals were met. The memory 
strobe lines RAS and CAS were closely analyzed so that the 
skew was minimized. 

The DRAMs are accessed using nibble mode so that a 
read operation can return a word of data every 125 ns after 
a latency period of 300 ns. The memory controller is im- 
plemented using TTL technology. 

The memory system communicates with the processor 
.mil the I/O system through the MidBus. which is a synchro- 
nous high-speed bus. There is parity checking on the Mid- 
Bus for the address, data, and control lines. The memory 
generates and checks parity on data reads and data writes 
to improve the reliability of the memory system. 

The memory can be accessed in either 16-byte or 32-byte 
transactions. The 16-byte transaction takes seven cycles 
and the 32-byte transaction requires 13 cycles. The 
maximum memory bandwidth for 16-byte transactions is 
18.285 Mbytes/s and for 32-byte transactions is 19.7 
Mbytes/s. 

At power-up it is necessary to initialize the memory con- 
troller. The architected control and status registers are vis- 
ible to the software in the I/O address space. The boot code 
initializes the memory controller, setting up the physical 
memory's address range via MidBus I/O transactions to the 
10 registers resident on the controller. 



Each 8M-byte main memory module is physically located 
on two boards. The memory controller board contains three 
banks of DRAMs and the memory array contains five banks 
of DRAMs. One memory controller communicates with one 
memory array card. While this increases the manufacturing 
price compared to a product that extends the reach of the 
controller to many memory cards, it has the advantage of 
reducing latency, since fewer DRAMs are addressed and 
the address and data buses are shorter. 

27 bits of the 32-bit physical address are used, This limits 
addressability to a 128M-byte physical space. 

Error Correction 

Error-correcting memory is standard. A 32-bit error-de- 
tection and correction (EDC) chip forms the basis of this 
circuitry. During a memory write operation, 32 bits of data 
are sent through the EDC logic, which generates seven 
checkbits. These are merged with 32 bits of data in the 
proper RAM bank. When a memory read operation occurs, 
these 39 bits are sent through the EDC logic, which inter- 
nally regenerates what the seven checkbits should be and 
compares them to the checkbits that it actually got from 
the RAM bank. The result of this comparison is called a 
syndrome. The checkbits for each 32-bit pattern are chosen 
so that the syndrome reveals useful information about any 
errors that are detected. If the error is a single-bit error, the 
syndrome can be decoded to see which bit is wrong. The 
EDC does this and corrects the error. Multiple-bit errors 
are not correctable. The best that can be done is to detect 
their presence and interrupt the processor by pulling on 
the error signal on the MidBus. 

System Monitor Module 

The power system consists of several major components: 

■ Ac front-end power distribution unit 

■ 5-kVA isolation transformer 

■ Fan tray with four ac fans and backup battery 

■ Three 300W power supplies 

■ System monitor module 

■ Internal and external control panels. 

The relationship among these components is represented 
in the system block diagram. Fig. 9. The system monitor 
module serves as an interface between the power supply 
and the SPU boards. It generates secondary power ( + 5V Si 
and + 5V S2) to the CPU and memory boards, processes 
power-on and powerfail warning signals to the MidBus. 
terminates and arbitrates MidBus signals, monitors system 
temperature, and interfaces with the control panel and ac- 
cess port. It also includes some miscellaneous processor 
dependent hardware consisting of the time-of-day clock, 
stable storage, diagnostic switches, and an EPROM for pro- 
cessor dependent code. 

The secondary power is generated by two dc-to-dc con- 
verters. In the normal mode of operation, the system 
monitor module converts +28V from two of the power 
supplies to +5.1V to supply up to six memory boards, or 
24M bytes of memory. During the backup mode, the dc-to- 
dc converters take power from the 10V 10A backup battery. 
The battery can supply up to six memory boards for at least 
15 minutes during a powerfail. An external battery connec- 
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tor is provided if additional backup time is required. The 
battery enable switch is built into the ac circuit breaker on 
the power distribution unit. Another battery test switch on 
the power distribution unit can bypass the battery enable 
switch for powerfail testing. 

The system temperature is monitored by four thermistors 
on the backplane. When an overtemperature situation oc- 
curs, the system monitor module turns on the yellow warn- 
ing light and sets a flag to the CPU at 45°C inside the 
cabinet. At 60°C. the system monitor module trips the ac 
circuit breaker and shuts off all the power including battery 
backup. 

The processor dependent hardware on the system 
monitor module communicates with the CPU via diagnos- 
tic instructions. The diagnostic instructions have access to 
the control-panel hexadecimal status displays, the time-of- 
day clock chip, the stable-storage CMOS RAM. and the 
diagnostic switches. The software can read from and write 
to these components via entry points in the processor de- 
pendent code. The clock chip and the stable-storage RAM 
are backed up by two lithium batteries on the system 
monitor module during powerfail. A parallel-to-serial in- 
terface converts 16 bits of hexadecimal display data serially 
to the access port to facilitate remote diagnosis. 
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An Automated Test System for the First 
HP Precision Architecture Computers 

Besides testing for proper operation, the system gathers 
specific failure information and generates summary 
statistics to be used in improving the manufacturing 
process. 

by Thomas B. Wylegala, Long C. Chow, and Randy J. Teegarden 



THE AUTOMATED TEST SYSTEM for the first com- 
puters of the HP Precision Architecture family can 
test up to ten HP 9000 Model 840 or HP 3000 Series 
930 Computers simultaneously. Fig. 1 is a block diagram 
of the test system. 

A Model 840/Series 930 Computer configured with two 
special boards can be connected to the test system via a 
cable. The test system then has the ability to load diagnostic: 
programs into the Model 840/Series 930 and monitor the 
results of those tests. The host for the test system is an HP 
9000 Model 220. but any HP 9000 machine that runs the 
HP-UX 5.1 operating system could serve as well. 

Key Features 

The Model 840/Series 930 computer contains 32K bytes 
of test code resident in ROM. This self-test code is executed 
whenever the computer is powered on or reset. There is a 



need to supplement this test code with additional special- 
ized test programs. Also, it is expensive to modify firmware 
based code, but easy to add a new test to the test system. 
Therefore, the test system provides the capability to down- 
load test programs into the memory of the computer under 
test and to initiate their execution. 

The test system monitors the results of test execution 
and writes the status to a log file. This eliminates the need 
to have a human operator constantly observing the unit 
under test to judge whether the unit has passed. The test 
system collects the data elements that are critical to the 
success of the quality control program. 

Little peripheral equipment is required to support the 
testing process. Without the test system, the minimum con- 
figuration to run diagnostics on Model 840/Series 930 pro- 
cessors includes a console and a disc for each unit under 
test. The peripherals for the testing of ten units would be 




Fig. 1. Block diagram ot the test 
system lor HP 9000 Model 840 and 
HP 3000 Series 930 processors 
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expensive and would consume valuable factory space. 

Finally, the test system requires minimal cooperation 
from the unit under test. Even catastrophic self-test failures 
can be detected properly by the test system. It can even 
download test programs into a unit whose normal input 
output channels are inoperable. 



which on Model 840/Series 930 processors also appear on 
the LED status display, enable the unit under test to send 
messages to the test system even if a failure in the CIO 
channel adapter prevents communication via the console. 
Finally, the test system can determine when the unit under 
test has been powered on or off. 



Hardware Developed 

The communications interface between the test system 
and the unit under test is based on the link used to support 
the remote debugger (RDB). 1 so the test system is fully 
compatible with RDB. That link involves a 16-bit path be- 
tween a GPIO card in the HP 9000 host and a parallel card 
in the Model 840/Series 930 under test. For the test system, 
two custom boards (called the test system interface and 
the GPIOA adapter) were designed to enhance the RDB 
link. The connections are shown in Fig. 2. 

Many advantages accrue from inserting these two custom 
cards into the path. Signals can be transmitted reliably over 
distances up tn 300 feet (versus 3 feet for the original link) 
using an EIA RS-422 differential interface. The system will 
work with both the new differential drive and the original 
single-ended drive versions of the parallel card. The test 
system interface also has access to signals on the access 
port card slot. These signals allow the test system to give 
the unit a hard reset and to receive the 16 bits of serial 
data sent to the access port card. These 16 bits of data, 
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Firmware Developed 

The I/O dependent code (IODC) 2 for the parallel card 
was written to make communications with the test system 
possible. The IODC contains a communications server 
which enables the Model 840/Series 930 to interpret the 
data being transmitted to it by the test system through the 
parallel card. When the boot information in the Model 
840/Series 930"s stable storage area indicates that a parallel 
card is to serve as the boot device, the communications 
server is launched after the self-test has completed. 

Software Developed 

The test system control software has a multiprocess struc- 
ture. The main process, which is scheduled when the test 
system is initiated, contains the user interface. It maintains 
the status windows, updates the softkeys, and executes 
user commands. There is one p_monitor process for each 
computer under test. These processes manage the com- 
munications with the units under test. Finally, there is a 
background process that performs periodic and intermit- 
tent tasks. The test system control software allows easy 
reconfiguration of each test station. Parameters that can be 
configured include the list of tests to be executed and the 
number of temperature cycles needed for test completion. 

The test system allows diagnostic programs to be loaded 
into the computers under test. Developing these diagnostics 
was another challenge. Several sets of test programs were 
written. For example, there is an exerciser for the transla- 
tion lookaside buffer (TLB) that verifies the proper opera- 
tion of each field in every entry of the TLB. A total of 49 
of the original architecture verification programs were 
adapted for use with the test system. These programs per- 
form extensive testing of the arithmetic, logical, and branch 
instructions. The test programs obey some conventions to 
make the test system's job easier. For instance, the programs 
use the access port interface path to transmit error codes 
to the test system to indicate failures or a pass code to 
indicate when the test has completed successfully. All of 
the tests written for the Model 840/Series 930 use the data 
FFFF in hexadecimal to signal success and preface a failure 
message with the data DEAD. The Model 840/Series 930 
tests also compute a checksum on their instruction text to 
ensure that the program was downloaded correctly. 

Relation to the Manufacturing Process 

Model 840/Series 930 boards are subjected to three levels 
of testing during their manufacture. First is a board-level 
in-circuit test on the HP 3065 Board Test System. 3 This 
test screens out most process related problems, such as 
bent pins and solder bridges. Next, boards are grouped into 
sets, installed into backplanes, and subjected to functional 
tests in a temperature chamber. The temperature chamber 
continuously cycles from 0 to 55"C with a period of two 
hours. This test detects temperature-sensitive component 
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failures and precipitates failures in marginal parts. The 
final step is to perform an operating system level verifica- 
tion test on the completely assembled unit. Verification 
test suites have been written for both the HP-UX and MPE 
XL operating systems. 

The test system plays its biggest role in the temperature 
chamber phase of testing. Model 840/Series 930 boards 
spend between ten and twenty-four hours in the chamber. 
During this time, the test system supervises the execution 
of the prescribed set of test programs. 

The temperature chamber test sequence has six phases. 

1. The operator installs a board set into a backplane in the 
chamber and applies power to the unit. The test system 
automatically detects that the unit has powered on and 
spawns a p_monitor process for it. 

2. The test system sends the unit a hard reset signal. The 
unit will perform its self-test and initiate the boot sequence. 

3. The test system waits until the unit signals successful 
completion of the boot from the parallel card. Anything 
other than a boot completed message is considered an error. 

4. The test system gets the next test to run from the test 
list. (The list is circular; the first test is rerun after the last 
test completes.) The test system downloads the test and 
directs the unit to execute it. 

5. The test system allows time for the test to execute and 
checks on the result. If the unit passed, the test system 
continues with step 2. 

6. In the event of an error, the test system records the error 
code, signals the operator that a failure has occurred, and 
attempts to rerun the test that failed. The system also gen- 
erates a failure record to be added to the data base. 

The test system is also a valuable aid in repairing defec- 
tive boards. Repair technicians have access to the same set 
of diagnostic programs that first indicated the presence of 
the failure. The technicians can query the test system data 
base to determine the exact circumstances of the failure 
and any prior attempts at repair. This obviates the need 
to have paper tracking forms accompany the boards to the 
repair area. 

To integrate the entire manufacturing operation, the test 
system is also used in the final configuration area. The full 
set of diagnostics is available, and the data base can be 
accessed. The test system also provides special utilities to 
initialize the units before shipment. 

Results Observed 

Before installation of the test system, the only test pro- 
grams that the Model 840/Series 930 could execute in the 
temperature chamber were those contained in ROM, that 
is, the power-on self-test. The test system supplements the 
self-test with a variety of other tests. Moreover, the test 
system forces the Model 840/Series 930 to complete the 
entire boot sequence, in itself a good test. The effectiveness 
of the tests can be measured by the failure rate experienced 
in the next level of testing, that is, the operating system 
tests. In the fourteen-week period before installation of the 
test system, the failure rate experienced during the operat- 
ing system tests was 19%. That is. of 228 sets of Model 
840/Series 930 boards that successfully executed self-test 
in the chamber, 44 were subsequently shown to have 
hardware failures by the next level of testing. By contrast. 



during the first seven weeks in which the test system was 
operational, the failure rate dropped to 2.5% (only one 
failure out of 41 sets tested). 

Summary 

The test system is designed to fill two critical needs in 
computer manufacturing. The first is to subject the comput- 
ers to stringent functional tests so that defects can be re- 
vealed at the earliest possible time. The second is to keep 
accurate records of the results of those tests so that the 
manufacturing process can be constantly improved. With 
respect to the first point, the test system has already been 
a great success. A number of failures were detected by 
testing in the temperature chamber that would otherwise 
have gone unnoticed until the operating system tests. 
Moreover, whenever better diagnostics are written, the test 
system will be ready to download them. The test system 
automatically generates a template for each failure occur- 
rence; the operator need only input the serial number of 
the board that was defective. This data base can be scanned 
to reveal the weak points of the process, 
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A Distributed Terminal Controller for 
HP Precision Architecture Computers 
Running the MPE XL Operating System 

Up to 48 terminals or printers connected to each controller 
communicate with HP 3000 Series 930 or 950 Computers 
over an IEEE 802.3 local area network. 

by Gregory F. Buchanan, Francois Gaullier, Olivier Krumeich, Eric Lecesne, Jean-Pierre Picq, 
and Heng V. Te 



WITH THE EVER INCREASING capacity of com- 
puter systems comes the demand for more termi- 
nal connections. On the other hand, SPU (system 
processing unit) cabinets are becoming smaller with each 
generation. Finding space in these smaller SPU cabinets 
for the increasing number of terminal connections is a 
challenge. 

For HP Precision Architecture computer systems using 
the commercial operating system, MPE XL, the solution is 
the HP 2345A Distributed Terminal Controller (DTC). The 
approach taken was to move the terminal connections out 
of the SPU cabinet and into the DTC. 

The HP 2345A is designed for the HP 3000 Series 930 
and Series 950 Computers. It enables up to 48 asynchronous 
devices (terminals or serial printers) to be connected to 
these systems over an IEEE 802.3 local area network (LAN), 
thereby greatly simplifying the cabling and lowering the 
associated costs. Future releases will allow a terminal user 
connected to the DTC to establish a session with an MPE 
XL system, and then by a simple command, to switch to 
another MPE XL system on the same LAN. This switching 
capability, combined with the possibility of distributing 
the DTC in a building, is a major contribution of this new 
commercial computer system family. 

Software Architecture 

Fig. 1 shows the overall DTC software architecture. To 
achieve the objectives of performance and simplicity, the 
DTC has a special operating system, AOS, which is a 
straightforward dispatcher with associated services like 
memory and timer management. An added benefit of this 
dedicated operating system was easy integration of the DTC 
software into a Pascal workstation along with debugging 
tools. This approach proved very efficient in the design, 
testing, and integration phases. 

The stack of protocols in the DTC has been reduced to 
a minimum, and some layers of the ISO OSI model have 
been combined. Layers one and two are the standard IEEE 
802.3 and IEEE 802.2 Class I. These standard protocols 
were chosen so that the DTC could share the LAN (the 
same physical cable, medium attachment unit, and LAN 
controller) with HP 3000 Network Services (NS/3000) or 



any IEEE 802.3 compatible devices. 

For the upper OSI layers, no protocols existed that met 
the objectives of performance and simplicity, so proprietary 
protocols are used instead of standard protocols such as 
TCP/IP and TELNET. It was also felt that additional func- 
tionality was needed in the standard protocols to ensure 
satisfactory support of terminals in the MPE environment. 
A goal was to offload the character-oriented tasks, like 
character backspace or line delete processing, from the host 
and do it in the DTC to save processing power in the host 
and at the same time provide real-time feedback to the 
user's keystrokes at the terminal. This results in greater 
overall system efficiency and more friendliness for the cus- 
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Fig. 1. Distributed terminal controller (DTC) software struc- 
ture AOS is the DTC operating system FCP is the How control 
protocol. DCP is the device control protocol NMP and RMP 
are network management and remote maintenance protocols 
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tomer, but it also means that the DTC needs a lot of intel- 
ligence and an elaborate software structure (Fig. 1) to or- 
ganize this processing. 

It was also apparent that some sort of transport mecha- 
nism is necessary to prevent flooding the DTC with data. 
Since the processing power of the host is at least an order 
of magnitude greater than the processing power of the DTC, 
the DTC needs some way to tell the host to stop sending 
data so the DTC can process what it has received. This 
functionality is also needed because a user at a terminal 
can always stop the flow of data and resume later, a printer 
can run out of paper, and so on. This requirement resulted 
in the definition of a special DTC flow control protocol. 

Flow Control Protocol 

As implied by the name, the most important feature of 
this protocol is to make sure that the DTC is not overflowed 
by the host. There are other protocols that provide this 
feature, but no single protocol that offers all of the features 
that are needed. The DTC flow control protocol (FCP) is 
based on a study done by Hewlett-Packard Laboratories in 
1984 of a protocol known as Fast Path, a higher-perfor- 
mance, stripped-down version of the ARPA TCP/IP pro- 
tocols. The DTC FCP is derived from Fast Path and from 
CCITT Recommendation X.25 Level HI. 

The features of the DTC FCP are: 

■ Simplicity 

■ Flow control 

■ Reliability 

■ Connection-oriented 

■ Connection assurance 

■ Fragmentation and reassembly. 

To maintain DTC efficiency, the FCP has been kept very 
simple. Since it is layered on top of IEEE 802.3, which 
provides error detection, error detection is not part of the 
FCP; only error recovery is needed. The FCP was also de- 
signed knowing that errors on a LAN are infrequent, so the 
mainstream functions are optimized while the error recov- 
ery processing is more cumbersome. 

The FCP is a reliable transport because it guarantees that 
all packets of data are delivered in the same order they 
were received, and that no packets are lost or duplicated. 
Reliable transport is needed because control information 
is exchanged between the DTC and the host, and algorithms 
that manage that exchange are simpler if they are based on 
a reliable transport protocol. 

The FCP is a connection-oriented protocol. There is a 
simple procedure for establishing and breaking connec- 
tions between the DTC and the host. With this scheme, 
each of the 48 ports of the DTC has its own connection for 
exchanging data and control information without interfer- 
ing with the traffic on the other ports. This feature will be 
very helpful in the future to provide access to more than 
one host from the same port of a DTC. 

To ensure reliable exchange of information on a port 
basis, a connection assurance mechanism was defined so 
thai each end, the DTC or the host, can make sure that the 
other entity is still alive and functioning. 

While MPE does not limit the size of a single write, the 
request has to be passed over the LAN, which restricts the 



size of the frames to 1518 bytes. Therefore, the FCP provides 
a way to fragment and reassemble the user data. The more- 
data bit provides this functionality. 

The protocol is fully symmetrical. There is no master/ 
slave relationship. 

FCP Operation 

To ease the implementation, the header of an FCP packet 
is of fixed length, as shown in Fig. 2. Seven types of packets 
are used to establish and break connections, to ensure re- 
liable transport of information, and to provide connection 
assurance. To offer sustained throughput without a request/ 
reply scheme, which is less efficient under heavy load, a 
window mechanism is used. 



1 pkdata; normal data 

2 pkack; ack or status reply 

3 pkrstat; request status 

4 pkcreq; connection request 

5 pkcreply; connection reply 

6 pkabort; abort/disconnect 

7 pknak; negative acknowledgment 
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Fig. 2. DTC flow control protocol packet tormat. 

Fig. 3 shows how flow control is performed using the 
window mechanism. Three fields are used in this al- 
gorithm: 1) the sequence number field, which identifies 
the packet sent, 2) the acknowledge number, which indi- 
cates the next sequence number that the receiver is waiting 
for, signaling to the sender that all packets sent previously 
have been received, and 3) the window field, which indi- 
cates the last sequence number the receiving end is willing 
to accept. In the example of Fig. 3, the left side opens a 
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Fig. 3. Flow control mechanism window field. 
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window for three packets at the beginning, and the window 
closes automatically when the right side sends seq#iO. The 
left side does not reopen the window with ack#u.win#iO. 
but only confirms the window number and the last se- 
quence number received. When memory becomes avail- 
able, the left side reopens the window for three more pack- 
ets. In the meantime, the right side waits for the window 
to reopen before sending any more data. 

Device Control Protocol 

As mentioned above, the DTC performs all the byte- 
oriented tasks. The host controls the behavior of the DTC 
using the DTC device control protocol (DCP). Fig. 4 gives 
an overview of DCP functionality. Similar protocols, for 
example CCITT Recommendation X.29 and the virtual ter- 
minal protocol used by HP network services products,' 
already existed but either could not be used for perfor- 
mance reasons or would have had to be modified so exten- 
sively that it seemed easier to define a new protocol. The 
DCP is designed to be simple, efficient, and compatible 
with the HP CIO (channel input/output) bus. 



User Program 
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Driver 



MUX Driver 
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Fig. 4. DTC device control protocol overview 



Simplicity is achieved through a small number of mes- 
sage types and a fixed-length header. Fig. 5 shows the mes- 
sage types defined by the DCP. It is fairly easy to decode 
a request since the format is fixed and relevant information 
(request code, data length, etc.) is always in the same place. 
Also, when changing the value of a parameter to alter the 
behavior of the DTC. the parameter can always be found 
at the same place in the packet. 

Because the protocol is compatible with the CIO bus. the 
same pieces of code can be shared to control devices, either 
over the internal bus of the host or remotely over a network. 
Sharing the same code and using the same protocol not 
only save resources, but also ensure that a terminal will 
always work the same way, regardless of the physical con- 
nection to the system. This important feature will guarantee 
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Fig. 5. Device control protocol message types. 

that in the future, even if new ways of connecting terminals 
are invented, the functionality and the behavior of those 
terminals will be the same, thus greatly simplifying migra- 
tion to the new means of connection. 

The device control protocol is implemented on a multi- 
plexer board to use the hardware resources, such as micro- 
processor power and RAM space, efficiently. One example 
of this efficiency is in the design of the reply sent by the 
multiplexer card. The reply has a trailer instead of a header. 
This allows the multiplexer to start sending data before 
the request is fully completed, thus saving RAM space. 

The DTC device control protocol is not symmetrical, the 
host being the master and the DTC being the slave. 

The host configures a port of the DTC using the write 
port configuration message. Fig. 6 shows the format of this 
message. It contains all the configuration information for 
read, write, and modem control operations. It also specifies 
the speed and parity for the port's operation (data bytes 6 
and 7). All MPE intrinsics are mapped into a set of values 
in this message, allowing the user program full control of 
the terminal. 

The DTC returns user data from a read request with the 
message shown in Fig. 7. The format of this message allows 
the DTC to forward data to the host as soon as it is received 
from the terminal. This greatly reduces the amount of mem- 
ory required to handle a read request, since there is no 
need to buffer the complete message before settding it. In 
addition, the protocol uses the fact that the LAN and the 
host can take data from the DTC faster than the terminal 
can deliver it. When the read is completed, the DTC sends 
the completion code, along with the time of the read, to 
the host in a fixed-length trailer in the last eight bytes of 
the complete message. 

The asynchronous event message allows the DTC to get 
attention from the host, or to signal that conditions enabled 
through the write port configuration message have oc- 
curred. Examples of asynchronous events are the establish- 
ment of a modem connection, a user calling lor attention 
(break or subsystem break), or a modem disconnection. 
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Fig. 6. Device control protocol write port configuration 
message. 



Network Management and Remote Maintenance 
Protocols 

The DTC can be looked at from two points of view. In 
the first, the DTC is treated as a local terminal multiplexer, 
except thai it is connected to the SPU by an extended bus. 
Like a local multiplexer, the DTC belongs to a single SPU. 
The difference in managing the DTC is that commands are 
sent on the bus instead of the SPU having direct access to 
the multiplexer hardware. That the bus is an IEEE 802.3 
LAN is irrevelant. It could just as easily be a fiber optic 
link, or some type of point-to-point connection. 

From the second point of view, the DTC is treated as a 
service provider on a LAN. The service it provides is termi- 
nal and printer access to MPE XL systems. From this point 
of view, the DTC is not restricted to a single MPE XL system. 
Instead, it can provide its terminal access service to several 
MPE XL systems at once. This point of view multiplies the 
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Fig. 7. Device control protocol read/write data reply message 
format. 



management problems because the DTC is a shared resource 
lhal may need to be managed by multiple MPE XL systems. 
To allow the DTC to grow, the second viewpoint was taken 
in developing the design criteria. To reduce the cost of the 
DTC. it does not have local mass storage. Instead, it depends 
on the MPE XL system lo load its code and configuration 
files, and it sends memory dumps and logs errors to files 
on the MPE XL system. The on-line diagnostics for the DTC 
are controlled by the MPE XL system. All of these functions 
require exchanges of information between the DTC and 
MPE XL, — in other words, a protocol. 

Two protocols designed by other HP networking groups 
closely met the needs of the DTC. These are the remote 
maintenance protocol (RMP) and the network management 
protocol (NMP). The DTC design team helped define these 
protocols, and the DTC is one of the first products to use 
them. Using these HP standard protocols will allow the 
DTC to fit easily into the HP network management strategy. 

The DTC uses the remote maintenance protocol to down- 
load its code and configuration files, and to upload diagnos- 
tic dumps. To begin a download sequence, the DTC sends 
a boot request packet asking for (he location of its code 
file. The boot request is sent to a LAN multicast address, 
so all MPE XL systems on the LAN receive it. Each of the 
MPE XL systems looks in its configuration file to see if it 
has t he code and configuration files for this particular DTC. 
If so, the MPE XL system responds with a boot reply, iden- 
tifying itself to the DTC. The DTC picks one of the respond- 
ing systems as its active loader and starts asking for code 
segments. The segment loading is carried out by a simple 
request/reply exchange of packets with a time-out retry. 
The DTC requests code segments one at a time from the 
MPE XL system. If it doesn't get a response within two 
seconds, it asks again. When it gets the reply the DTC loads 
the code and asks for the next segment. Once the load is 
done, the DTC sends a boot complete packet telling the 
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MPE XL system the load is complete. A similar sequence 
is used to load the DTC-specific configuration file, except 
that the opening request is sent to the MPE XL system that 
loaded the code file, and not to a multicast address. The 
configuration file contains information such as the speed 
and parity for each of the terminal connections. Once the 
DTC has loaded its code and configuration, it comes on- 
line. 

The DTC uses the network management protocol to pro- 
vide on-line diagnostics from the MPE XL system. These 
functions include resetting ports, loopback testing, status 
inquiry, and error logging. The starting point for these diag- 
nostics was the current MPE Termdsm diagnostic. How- 
ever, changes were required to reflect the separation of the 
DTC from the SPU. Improvements have also been made in 
presenting the data to the user. New Termdsm commands 
were added to display the status of the DTC as a whole, 
and to display the status of a serial port. Both status com- 
mands format the data into customer understandable form. 
For example, state information is displayed in the user's 
native language (e.g., "read pending") rather than as a 
number. The goal of these two displays is to allow the 
customer to do the first level of problem diagnosis, rather 
than having to call in HP service personnel. An example 
section of a port status display is shown in Fig. 8. 



(IEEE 802.3 type 10 base 2). Only one of these network 
connections can be used at a time. 

The processor card uses an 8-MHz 68000 microprocessor 
with 512K bytes of RAM and 64K bytes of EPROM. Also 
on the card are a network interface, timers, and DIO drivers. 

The serial interface card handles the processing, storage, 
and multiplexing of the data to and from eight terminals. 
It must be connected to a connector card located on the 
opposite side of the backplane. The serial interface card 
circuitry mainly consists of a Z80B processor, 16K bytes 
of EPROM, 32K bytes of shared RAM (accessible by the 
Z80B and the processor card), 8K bytes of fast RAM (acces- 
sible only by the Z80B), and USARTs. 

The connector card is used for physical attachment of 
the terminals. Three types of connector cards are offered 
for six RS-232-C modem connections, eight RS-232-C direct 
connections, or eight RS-422 connections with electrical 
isolation. It is possible to mix the three types of connections 
in the same DTC. 

A display located on the front panel and connected to 
the processor card shows the DTC status (download of code 
in progress, errors, self-test results, etc.). The DTC firmware 
contains a self-test, which is executed automatically at each 
power-on. and an off-line diagnostic program, which is run 
from a local terminal connected to DTC port zero. No host 
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Fig. 8. A portion of a port status 
display. 



DTC Hardware 

The DTC is designed around a central backplane (based 
on the DIO bus) using a press-fit technology. Up to six 
connector cards and serial interface cards can be connected 
to the processor card through the backplane (see Fig. 9). 
The processor card has been leveraged from a hoard de- 
signed at the HP Colorado Networks Division. It includes 
the LAN access interface and the DTC's overall manage- 
ment functions. 

Two types of LAN accesses are offered: either Backbone 
LAN cable (IEEE 802.3 type 10 base 5) or ThinLAN cable 



connection is needed to use the off-line diagnostic. The 
DTC hardware is fully tested by the self-test and the off-line 
diagnostic. 

After code is downloaded from the host to the DTC, 
on-line diagnostics can be started from the system console 
to cause various diagnostic and control functions to be 
executed by the DTC (see preceding section). 

The design of the DTC hardware is generic in the sense 
that it can support almost any software architecture. This 
ensures compatibility wilh future applications such as sup- 
port of new peripherals or concurrent stacks of protocols. 
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Fig. 9. fa) DTC hardware architecture, (b) Physical layout. 



L be the incoming flow, that is. the sum of the packets 
received from the LAN or the terminals. Let N be the 
number of protocols that must be activated to receive or 
transmit one packet (N = 5). The rate of arrivals in the 
queuing system is NL. and 1/L is the mean time between 
message arrivals in the queue. 

For the sake of simplicity, assume that the arrival rate 
of packets and the processing times are Poisson processes 
(although this is not always the case). To go through the 
DTC, a packet goes through an M/M/l system several times 
(Fig. 10); the DTC is a network of M/M/l systems. Queuing 
theory shows that in the case of a network of queues, if 
the arrival rate in one node is a Poisson process, and if this 
node partitions this stream into several streams with a given 
probability of choice, the partitioned streams are also Pois- 
son processes. In the same way, if a node merges several 
streams into one stream, the resulting stream is also a Pois- 
son process. For this reason, the formulas of M/M/l systems 
are still valid (Jackson's theorem). 2 
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Fig. 10. DTC queuing model. 



By applying the M/M/l system model, the time to 
through the system once is found to be: 



Performance Prediction and Measurement 

The theory of operation of the DTC is based on queuing 
theory, 2 specifically the M/M/l systems, in which custom- 
ers of a service arrive in a system made up of a queue and 
a server. The arrivals have a Poisson pattern. The queue 
discipline is FIFO. The processing time of the server is also 
a Poisson process. There is only one server for the queue. 
The average arrival rate in the queue is L. The mean service 
time of the server is 1/M. Then the mean time spent by the 
customers in the system (queue + server) isT= 1/(M - L). 

In the DTC, each protocol communicates with the others 
using messages (carrying user data or not). But only one 
protocol executes at a time. When a protocol sends a mes- 
sage, this message has to be queued by the DTC operating 
system, AOS. When the currently executing protocol exits. 
AOS calls the next protocol, which is the destination of 
the first message of the queue. Thus the DTC can be seen 
as a queuing system made up of a queue of customers (the 
messages) and a server (the destination protocol of the first 
message of the queue). 

With this design, a packet that has to be processed by 
several protocols (IEEE 802.3, DTC FCP. DTC DCP) is 
queued several times to go across the DTC once. Some of 
the messages processed are put in the queue again, so that 
the flow out of the server partially reenters the queue, as 
shown in Fig. 10. 

Let E = 1/M be the mean execution time of the protocols. 
This is the mean service time of the queuing system. Let 



1 

1/E - NL 

If an item (packet, etc.) in the DTC undergoes a cycle made 
up of P tasks, the time to perform this cycle is: 

PE 
1 - NLE 

Obviously, NLE must be smaller than 1. since: 

L< t~ = maximum throughput. 

We want to calculate the latency of the DTC and its 
maximum throughput. Most of the results depend on the 
traffic and/or on the number of active connections. Results 
will be given here as functions of the total traffic L. the 
sum in both directions of all of the flow rates of all of the 
active connections. 

Acknowledgments versus Useful Packets 

At the DTC FCP level, two kinds of packets travel on the 
LAN: data packets, which carry user data and correspond 
to one I/O operation, and service packets (ACK. NAK). The 
total throughput is the sum of the data throughput and the 
number of ACK/NAKs per second. As the processing time of 
each node limits the total throughput, the data throughput 
depends on the rate of ACK/NAKs. The LAN is reliable 
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enough and memory pools are dimensioned so that over- 
runs occur rarely enough to neglect the NAKs and the du- 
plicated packets. To estimate the data throughput we there- 
fore need to know the rate of ACKs. 

The number of ACKs per transaction depends on the struc- 
ture of a tr a nsaction, and on the window. According to a 
real-time analysis of MPE systems conducted at Hewlett- 
Packard Laboratories, we can consider that each transaction 
is made up, on the average, of 1.4 reads, 7.2 writes, and 
2.2 controls, for a total of 10.8 terminal I/Os. 90% of the 
user transactions will require 15 terminal L'Os per user 
transaction or less. This means that the DTC receives, on 
the average, 10.8 packets when it transmits 1.4 packets 
(that ratio is a result of the assumption of Poisson process- 
es). The packet transmitted by the DTC in a transaction 
implies the receipt of one ACK. The packets received by 
the DTC imply a number of ACKs, depending on the window 
size and on the value of the stand-alone acknowledgment 
timer. If this timer is correctly tuned, the number of ACKs 
transmitted per transaction will be 10.8 divided by the 
value of the window. Therefore, the traffic will have the 
following structure: 

User packets transmitted per transaction 1.4 
User packets received per transaction 10.8 
ACKs received per transaction 1 

The number of ACKs and therefore the total number of 
packets exchanged in each transaction depends on the win- 
dow size: 



Window 



12 3 4 



ACKs transmitted per 

transaction 10.8 5.4 3.6 2.7 2.1 

Packets per transaction 24 18.6 16.8 15.9 15.3 

We can deduce the mean number of ACKs exchanged 
(received or transmitted) per user packet exchanged: 



Window 



2 3 



ACKs exchanged per 

packet 0.96 0.52 0.37 0.3 0.25 

Let U be the ratio of total traffic to data traffic. Since the 
total traffic is equal to the data traffic plus the ACKs, U has 
the following values, depending on the window: 



this copy takes less than about 16% of the packet processing 
time and is not accounted for in this analysis. 

Maximum Throughput 

The maximum throughput is 1/NE. where N is the 
number of protocol activations for reception or transmis- 
sion of one packet. NE is in fact the total processing time 
of a packet. N = 5 for receipt or transmission of user data, 
and N = 3 for ACKs. The mean rate of ACKs per user data 
packet is given above as a function of the window size. 
Using this rate as a weighting factor on the processing 
times, the table below gives the mean value of NE. T is the 
data throughput. From the mean value of NE, the maximum 
total throughput (user packets + ACKs) and the maximum 
data throughput (data packets only) are deduced. 

Window 1 2 3 4 5 



ACKs per data 

packet 
MeanNE(ms) 
Maximum total 

throughput (L| 
Maximum data 

throughput (T) 



0.96 
4.1 



0.52 
4.7 



0.37 
5.1 



0.3 
5.3 



0.25 
5.4 



243 



124 



212 



140 



196 



143 



190 



146 



185 



148 



Throughputs are expressed in packets per second, so if 
the window is 3, the maximum total throughput is 196 
packets per second, and the maximum data throughput is 
143 packets per second. 

Among 48 terminals, approximately 24 will be active (1 
I/O per hour), and among these 24 terminals only 12 are 
doing real work (8 transactions per minute). Therefore the 
mean data throughput (or nominal traffic) needed is about 
19.5 packets/s and the mean total Ihroughput is 26.7 pack- 
ets/s. Larger flows may occur for printers. Packet concate- 
nation decreases the number of packets per second ex- 
changed, but increases the length of each packet, so that 
the processing time of one packet may become dependent 
on the length of this packet. 

An estimation of the peak can be obtained by assuming 
the following conditions: each of 36 terminals does 15 
transactions per minute (think time plus response time = 
4 seconds) and 15 terminal l/Os are done per transaction. 
The data traffic is then 135 packets/s. In other words, the 
DTC supports up to 6 times the expected nominal traffic. 
The maximum throughput of a serial interface card is about 
6x17.2 kilobaud outbound and 8x10 kilobaud inbound. 



Window 1 2 3 4 5 

U 1.96 1.52 1.37 1.3 1.25 

Protocol Execution Time 

The execution times are different for each protocol. These 
times were calculated and later confirmed by actual mea- 
surements on the DTC. The mean value of E for data packets 
is E = 1.25 ms, and for ACK packets is E = 0.65 ms. 

These values do not depend on the packet length. This 
is not exactly true, because the DTC DCP copies data to 
and from the serial interface card. However, since packets 
are assumed to be shorter than 80 bytes most of the time, 



Latency of the DTC 

The latency of the DTC is the sum of the latencies of the 
serial interface card and the CPU card. The latency of the 
serial interface card is about 1 ms. The latency D of the 
CPU card is the time for processing P protocols. As 
explained earlier, the time to perform P tasks is D = PE/(1 
- NLE). We have seen above that U = L/T depends on 
the window. The latency of the DTC is: 

PE 

Latency = D + 1 ms = „, + 1 ms. 

1 - NEUT 
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The theoretical curve in Fig. 1 1 shows the latency when 
the window is 3, E = 1.25 ms, NE = 5.1 ms, P = 4. U = 
1.37, and T varies from 0 to 1/NEU. If T is equal to the 
mean throughput expected (20 packets/s). the latency is 
about 8 ms. 

There is a saturation effect. The latency cannot become 
infinite. When the memory pools are empty the incoming 
flows stop until resources are released. In such a situation, 
each packet takes about four seconds to process. This can 
be considered the limit, and it implies a DTC contribution 
of about 40 seconds in each transaction. 
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Fig. 11. DTC latency as a function of throughput. 



Other Results 

This model has been used to calculate other parameters 
of the DTC. including the minimum window that does not 
decrease the maximum throughput, the influence of over- 
runs on the traffic (creating NAKs and duplicated packets), 
and the size of the memory pools needed to keep a low 
level of overruns for received packets without decreasing 
the maximum throughput for data transmission and ACK 
transmission. Actual measurements with special equip- 
ment have shown a latency of 9 ms for low use, and a 
maximum throughput of 120 to 150 packets/s (see Fig. 11). 
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Hewlett-Packard Precision Architecture 
Compiler Performance 



Using a combination of simple instructions, optimized in-line 
code, and highly specialized Millicode routines, HP 
Precision Architecture machines perform many complex 
operations faster than CISC machines. 

by Karl W. Pettis and William B. Buzbee 



THE NEW HEWLETT-PACKARD Precision Architec- 
ture 1 is designed to provide a low-level interface to 
hardware through a small, fast instruction set. With 
any new architecture, compilers must be developed to pro- 
vide high-level language interfaces to the machine. This 
importance of compilers to the new architecture was recog- 
nized from the project's inception, and software engineers, 
including compiler experts, were heavily involved in spec- 
ifying the architecture. 

This paper describes the influence performance criteria 
had on the implementation of the new compilers and how 
various problems were overcome. First, the influence that 
the high-level languages had on the design of the instruc- 
tion set is described. Specific examples of instructions are 
given that enable the compilers to implement some high- 
level constructs efficiently, and to avoid problems that 
some see as inevitable with a reduced instruction set com- 
puter (RISC). Next, the problem of doing truly complex 
operations is described. These operations are sometimes 
implemented as instructions on traditional machines. In- 
stead, it was decided to implement a streamlined procedure 
calling convention and a group of routines known as Mil- 
licode to solve such problems. Finally, the results for spe- 
cific examples are presented. 

High-Level Language Influence 

As described in a previous paper. 2 a team of engineers 
specializing in various areas of computers designed the 
new architecture. The design was based on studies that 
showed what instructions computers actually spend time 
executing. These studies showed that even computers with 
very large, complex instruction sets typically spent an over- 
whelming portion of their time executing very simple in- 
ductions, such as memory loads and stores, branching, 
address calculation, and addition. So the initial design of 
the instruction set provided for the efficient execution of 
such frequently used instructions. 

A key factor in the design was to make all instructions 
except branches and loads from memory execute in a single 
machine cycle. In addition, the cycle time itself was to be 
very short. Proposed instructions that could not meet these 
stringent criteria without adding considerably to the com- 
plexity of the processor were either eliminated or modified. 
Nevertheless, the software engineers on the design team 
made sure that the resulting instruction set was sufficiently 



rich to allow efficient implementation of most high-level 
language constructs. Sometimes this was done by adding 
helper instructions that are simple enough to be executed 
in a single cycle to make implementation of more complex 
operations easy. An example is the DIVIDE STEP instruction, 
which, when combined with an ADDC (add with carry) in- 
struction, computes a single bit of the quotient of a division. 

Very early in the project, a prototype C compiler and a 
processor simulator were developed that enabled the en- 
gineers to make sure that high-level constructs could be 
executed easily using the new instruction set. As the in- 
struction set was implemented and evaluated, changes 
were suggested and accepted for the definitions of test con- 
ditions for various instructions, the way that nullification 
works for conditional branches, and other aspects of Ihe 
instruction set. 

It is worthwhile to look at some of the instructions that 
help the compilers produce efficient code. 
SHIFT AND ADD. The SHnADD family of instructions includes 
SH1ADD, SH2ADD. and SH3ADD (along with counterparts that 
do not set the carry/borrow bits and others that trap on 
overflow). These instructions shift the contents of the first 
register argument left by n bits and then add the contents 
of the second register argument, putting the result into the 
third register argument. The result is that multiplications 
by small constants can be performed in a few instructions. 



Instruction 

SHI ADD M,r2.r3 
SH2ADD M,r2,r3 
SH3ADD M,r2,r3 



Result 

r3«-2*rl + r2 
r3«-4*rl + r2 
r3*-»*rl + r2 



For example, to multiply a value in register 8 by the con- 
stant 41 with the result being put into register 9. the follow- 
ing code would be produced: 



SH2ADD 8.8.9 
SH3ADD 9,8,9 



r9 — 5*r8 

r9«-8*(5*r8) + r8 = 41*r8 



These instructions ensure that multiplication by a small 
constant, which is done fairly often, is very inexpensive. 
These instructions are also used in the nonconslant case. 
Even when both operands of a multiplication are not con- 
stants, often one of the operands is fairly small. When this 
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is true it usually takes about 20 cycles to multiply two 
variable quantities. This fast variable multiply is performed 
by extracting four-bit "digits" from the smaller operand 
and using them to select the appropriate shift and add 
sequence to perform the multiplication. 
EXTRACT AND DEPOSIT. A family of instructions is provided 
to access fields within records. These instructions either 
extract the value from a field or deposit a value into a field 
and are used by the compilers when accessing packed or 
nonaligned data. In some other computers, one has to shift 
and mask to retrieve values in packed records. As an exam- 
ple, here is a sample of some code from a C pro- 
gram along with the instructions generated. The C code is: 

struct descriptor j 

unsigned int available 1 ; 
unsigned int locked ; 1 ; 
unsigned int lasLaccess :10; 
unsigned int desc_type ! 5; 
unsigned int length :15; 
}: 

copyjype(p.q) struct descriptor «p, «q; 

f 

p -• descjype = q -» descjype; 

) 

The HP Precision Architecture instructions to do the 
copy from one field to another are: 

; The word containing q — desc_type 
; is in register 29. The word contain- 
; in gp-- descjype is in register 25. 

EXTRU29,16,5.31 ; Extract the five bits of register 29 
; corresponding to q-»desc_type. 
; This is an unsigned field, so use 
; EXTRU. 

DEP31,16,5,25 j Put those five bits into the word 

; containingp— descjype. 

Being able to use just two instructions to do a transfer 
of packed data makes it easier for the compilers to generate 
efficient code when programmers use packed structures. 
However, these instructions are sufficiently generalized 
that the compilers can use them to perform left and right 
shifts (logical and arithmetic), and they can be used in 
contexts that have nothing to do with fields within a record. 
Conditional Branches. On many computers conditional 
branches are implemented by performing some computa- 
tion (typically a subtraction) that has the side effect of 
setting some hardware condition flags. The next instruction 
conditionally branches based on the settings of these flags. 
The studies done by the HP Precision Architecture en- 
gineers indicated that conditional branches are done very 
frequently by programs. The decision was therefore made 
to combine testing of various conditions with branching 
so that a conditional branch only requires one instruction. 

Many of the conditional branches can also be used to 
modify register contents. Thus, the ADDIBT and ADDIBF in- 
structions (add immediate and branch if true or false) can 
be used by the compilers to update a loop counter while 
simultaneously testing to see if the loop is completed. Al- 
ternatively, by using the TR condition (TRue = always 
branch), the compilers can sometimes merge an uncondi- 



tional branch with a nonbranch instruction to make pro- 
grams shorter and faster. Branches are implemented in this 
architecture so that the target of the branch is not executed 
until two cycles later. On conventional systems, the cycle 
following the branch is wasted. With HP Precision Ar- 
chitecture, the instruction immediately following the 
branch instruction is executed in the cycle before the 
branch takes effect. This enables the compilers to use this 
instruction slot for useful work. The branch instruction 
can optionally nullify this following instruction so that it 
will have no effect. If a compiler cannot always use the 
cycle after a branch, it will turn on nullification. If nullifi- 
cation is specified for an unconditional branch, the follow- 
ing instruction will always be nullified after the branch 
instruction. If nullification is specified for a conditional 
branch, then the following instruction will be nullified if 
the branch is taken or if the branch is backward, but not 
both. This nullification scheme is designed so that if a 
conditional branch is being used in controlling a loop, an 
instruction from the body of the loop can always be exe- 
cuted in the otherwise wasted cycle after the conditional 
branch. The effect is that almost all loops can be made one 
instruction shorter and faster than with a more conven- 
tional machine. Since most computer programs spend most 
of their time in very small loops, even one instruction saved 
per loop can be significant. 

Decimal Correction. There are two instructions, DCOR (deci- 
mal correct) and IDCOR (intermediate decimal correct), that 
provide invaluable assistance for decimal arithmetic. One 
of the criticisms often leveled at typical RISC architectures 
is that they provide insufficient support for decimal oper- 
ations, which are crucial to commercial languages such as 
COBOL and RPG. Using a biasing scheme and the DCOR 
and IDCOR instructions, compilers for HP Precision Ar- 
chitecture systems can implement decimal additions and 
subtractions with the ordinary binary ADD and SUB instruc- 
tions and minimal extra instructions. Other decimal oper- 
ations are more complex and are described later in this 
paper. 

STORE BYTES. The STBYS instruction stores zero to four 
bytes to memory depending on the offset specified in the 
instruction and the alignment of the base address. It is a 
great aid in moving data from one location to another, 
particularly when a filling operation is being performed. 
This often happens when using strings in Pascal and For- 
tran and with the MOVE directive in COBOL. This instruc- 
tion also lets the compiler writer generate simpler code to 
store data that is inconveniently aligned with respect to 
word boundaries. 

Nullification. Most arithmetic and logical instructions 
allow conditional nullification of the following instruction. 
In effect, this is a way for the compilers to generate short 
if-then sequences in-line without incurring the overhead 
of actually doing a test and branching. The compiler most 
often uses this feature in short "canned" code sequences 
to compute a result. For example, suppose a Boolean flag 
in Pascal is to contain a value indicating whether a is less 
than b or not. The Pascal code is: 

flag : = a < b; 
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The HP Precision instructions to perform this statement 
are: 

COMCLR. > = r1.t2.r3 : a = rl.b = r2.flag = r3 
LDI 1,r3 

The COMCLR (COMPARE AND CLEAR) instruction sets the 
value of r3 (flag) to 0. If the value of rl (a) is greater than 
or equal to the value of r2 (b). then the following instruction 
is nullified, and the value of flag remains 0 (false). Other- 
wise, the LDI (LOAD IMMEDIATE) instruction sets the value 
of flag to 1 (true). 

The ability to nullify the following instruction condition- 
ally is very powerful and can be used in many contexts. 
For example, to perform a signed integer division by a 
power of two (such as 8) with truncation towards 0. only 
three instructions are needed: 

OR,> - M,0,r2 ; Put a copy of the source 

: (rl) into the destination 

: (r2). Also test to see if the 

; source is positive or 0. If 

; so, then the next instruc- 

; tion is nullified. 

ADDI 7,r2,r2 ; Executed only if the source 

; was negative. If it was, 

: (his ensures truncation 

; towards 0. 

EXTRS r2.28,29.r2 : Arithmetic right shift the 

; (possibly modified) 

; destination by three bits 

; yieldingthe final result. 

An arithmetic right shift by three bits (implemented by 
the EXTRS instruction) divides a quantity by eight, with 
truncation towards negative infinity. For positive numbers, 
this is the same as truncation towards 0. For negative num- 
bers, an adjustment needs to be made. Here we have used 
nullification so that the adjustment is done only if the 
dividend is negative. If nullification were not available, 
we would have had to generate code using branches. 

In short, the instruction set provided by the new HP 
Precision Architecture is very powerful, allowing short in- 
struction sequences to be generated for many high-level 
operations. Having a simplified instruction set where every 
instruction executes in a single cycle frees the compilers 
from having to make complicated analyses of varying in- 
struction sequences to see which is better. With HP Preci- 
sion Architecture, shorter is better. 

Procedure Calling Convention 

When the first compiling systems were initially brought 
up on the new architecture, a traditional procedure calling 
convention was investigated. This had a frame pointer and 
a top-of-stack pointer, and parameters were passed to 
routines by pushing them onto the stack. Upon return, the 
parameters were popped. However, it was discovered thai 
procedure calls could be performed even more efficiently 
with a new procedure calling convention. 

Under the new convention, the registers are now divided 
into three classes: the caller-saves, the col/ee-suves, and 



the linkage registers. The called routine is free to modify 
the caller-save registers with no overhead, but must save 
and restore any registers it uses in the callee-save set. If a 
routine is simple and does not use many registers, the 
overhead in making a procedure call is often only a few 
instructions. There is no pointer to the previous frame, nor 
are parameters pushed and popped as on conventional 
machines. The first four parameters are passed in registers. 
If there are more parameters, these are placed into an area 
on the stack allocated once at the beginning of the calling 
procedure. 

In addition, a special class of routines has been de- 
veloped with an even more streamlined calling convention. 
These routines are called Mi/Jicode and they are designed 
to implement more complex operations that are done fre- 
quently by programs. Some of these routines, like the mul- 
tiplication routine, correspond to machine instructions im- 
plemented by microcode on other machines. In general, 
these routines are only allowed to modify a very small 
number of registers (typically 4 to 6). The compilers know 
which few registers can be modified by each Millicode call 
and can arrange to have those registers free, while using 
other registers (including some caller-save registers) to 
store temporary intermediate results. Since these routines 
use so few registers, the compilers can almost always gen- 
erate code that calls them without having to do any extra 
saving and restoring of registers. Also, the linkage registers 
required to call and return from Millicode routines are 
different than for normal routines. This enables routines 
that only call Millicode and not other normal routines to 
have even less overhead, since they are not required to 
store their linkage information. 

Complex Operations 

One of the most enduring criticisms of RISC architectures 
is that they are unable to handle complex operations effi- 
ciently. Critics often suggest that applications that rely 
upon a high percentage of complex operations will execute 
slowly, suffer from excessive code size, or both. In part 
because of the RISC extensions provided with HP Precision 
Architecture, we have experienced the opposite. HP Preci- 
sion Architecture machines running the most popular 
COBOL processor benchmarks outperform their CISC cou n- 
terparts by a factor of 1.3 to 4.2, above and beyond differ- 
ences in the machines' respective MIPS rales (see page 35). 
Furthermore, this performance is achieved using 15% to 
30% fewer in-line instructions. 

The reason for this success lies in the elegant partnership 
of custom in-line code sequences and Millicode. Together 
Ihey form a solid foundation for the somewhat paradoxical 
assertion that complex operations can benefil more from 
architectures that concentrate on fast simple operations 
than from those that concentrate on fast complex opera- 
tions. 

To understand the HP Precision Architecture complex 
operation solution, we must first understand the problems 
of generating code to perform a complex operation. For the 
purpose of this discussion, a complex operation is a task 
that requires three or more of a machine's most basic in- 
structions to complete. Some obvious and very important 
examples of complex operations are byte moves, string 
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comparisons, and decimal arithmetic. 

In traditional complex instruction set computers (CISC), 
every machine instruction is performed by the execution 
of a corresponding microcode program. The microcode pro- 
grams are made up of microinstructions — the instructions 
that the actual hardware executes. In a sense, CISC systems 
are really computers inside of computers. HP Precision 
Architecture has eliminated the outer simulated computer, 
providing for the direct execution of its machine instruc- 
tions. From this point of view, HP Precision Architecture 
machine instructions can be thought of as roughly equiva- 
lent to CISC microinstructions. The reason for elimination 
of the outer computer is simple. There is a fixed overhead 
associated with the simulation of the CISC outer computer 
on the CISC inner computer. For simple instructions this 
interpretive overhead can account for more than the amount 
of time actually spent performing the operation. HP Preci- 
sion Architecture has streamlined simple operations by 
eliminating this overhead. 

For complex operations, however, the story changes. The 
CISC overhead is relatively fixed. Complex operations often 
require the execution of dozens or hundreds of microin- 
structions. Thus, the overhead becomes insignificant. Simi- 
larly, the advance HP Precision Architecture gains by 
eliminating overhead also becomes insignificant for com- 
plex operations. Given only this information, it would seem 
that the critics are correct, and that RISC-like architectures 
face a serious complex operation performance problem 
when competing head-to-head with CISC systems. 

A Closer Look 

The conclusion that RISC-like systems are inherently 
slower than CISC systems for complex operations rests on 
the assumption that RISC-to-CISC MIPS ratios are invalid. 
As the argument goes, it is assumed that more RISC instruc- 
tions are required to perform a complex operation than 
CISC instructions, and therefore, a RISC MIPS is worth less 
than a CISC MIPS. This assumption, however, is not neces- 
sarily correct. 

Early in the development of HP Precision Architecture 
complex operation code generation, selected CISC micro- 
code programs were carefully examined. One important 
characteristic quickly stood out: microcode programs for 
many types of complex operations spend a large amount 
of time calculating and interpreting information at run time 
that compilers knew at compile time. Under a CISC system, 
this information is lost because a compiler cannot transmit 
it through the instruction set to microcode. Because micro- 
code program space is scarce and expensive, there is typ- 
ically only a single CISC instruction per class of complex 
operations. For example, a CISC system may have one string 
comparison instruction, one decimal addition instruction, 
etc. The microcode programs associated with these instruc- 
tions must therefore be capable of handling all situations 
and must typically make worst-case assumptions. A com- 
piler may know, for example, that the source and target of 
a byte move do not overlap, but in a CISC system the mi- 
crocode would have to determine this at run time. 

The obvious solution for the speed problem is to exploit 
all available compile-time information lo eliminate the in- 
terpretive overhead suffered by CISC microprograms. The 



compilers, in effect, become custom microcode program 
generators. Code generation for simple operations takes 
place as usual, but when a complex operation is called for, 
the compiler analyzes all available compile-time informa- 
tion and generates a custom, highly specific "microcode" 
program that it drops in-line. Using this scheme, an HP 
Precision Architecture system could perform the operation 
in fewer cycles. There is. however, a serious problem with 
this approach — code expansion. Genuinely optimal code 
sequences for complex operations could require dozens or 
hundreds of HP Precision Architecture instructions. In 
some situations, code size could become enormous, 
perhaps to the point where any speed improvements would 
be lost in increased cache and page traffic. 

If code size were the only concern, then the opposite 
approach is to pretend that code is being generated for a 
CISC machine. In other words, develop a run-time library 
or, ultimately, an interpreter for a compact pseudocode, 
which contains one procedure for every CISC complex in- 
struction microprogram. Whenever a complex operation 
needs to be performed, the compiler generates a call to the 
appropriate routine within the library. This scheme solves 
the code expansion problem by reducing the in-line cost 
of a complex operation to that of a procedure call, a cost 
roughly comparable to the fetching and decoding cost of a 
CISC instruction. Furthermore, assuming that the run-time 
library is shared among all processes on a system, this 
scheme eliminates stress on the memory hierarchy. This 
code size solution, however, sends us back to square one. 
Such a run-time library suffers the same interpretive over- 
head as CISC microcode programs. 

HP Precision Architecture Solution 

The ideal solution would be to combine the advantages 
of the two approaches without incurring their disadvan- 
tages. The HP Precision Architecture solution comes very 
close to this ideal. In brief, the HP Precision Architecture 
compilers examine every complex operation and break it 
down into steps. The steps in which compile-time informa- 
tion is either unavailable or useless are performed by a call 
to a routine within a special shared library, keeping the 
code size compact. For the remaining steps the compilers 
determine whether the step is interpretive or repetitive. A 
repetitive step is one that can be performed in a library 
routine with little or no interpretive overhead, while an 
interpretive step is one that would suffer from interpretive 
overhead if performed in a library routine. Interpretive 
steps are performed using custom in-line code sequences. 
Repetitive steps are performed by calls to highly specific 
library routines. A more precise description of the HP Pre- 
cision Architecture complex code generation solution is 
given in the pseudocode algorithm of Fig. 1. 

The HP Precision Architecture solution is further en- 
hanced by the nature of its special shared library routines, 
which are known as Millicode. In its simplest form. Milli- 
code is nothing more than a series of instructions packaged 
to look somewhat like a procedure. A Millicode routine is 
invoked by a simplified calling mechanism in which almost 
no state saving is required. Millicode is designed so that a 
single copy can exist on a system, allowing all processes 
to share it. 
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IF [ no useful compile-time Information about the operation la available j 
THEN 

IF | the number of instructions required to perform the operation ) 
LESSTHAN [ speed space factor ] THEN | Perform the operation 
entirety In-line ] 

ELSE 

[ Generate call lo general-purpose Mllllcode routine ] 

ELSE 

IF [ me operation Is fully described by complle-tlme information ] 
THEN 

IF [ the number of Instructions required to perform the opera- 
tion ] LESSTHAN [ speed space factor ] THEN [ Perform 
the operation In-line using optimum custom code 
sequence ] 

ELSE 

BEGIN 

[ Separate operation info repetitive and Interpretive steps ] 
FOR [ each step ) DO 

IF [ step Is Interpretive ] THEN 

[ Perform the step In-line using optimum custom 
code squence ] 

ELSE 

| Generate call to one of a set of highly specific repeti- 
tive Mllllcode routine* ) 

END 

ELSE 

BEGIN 

| Separate operation Into steps In which useful complle- 
tlme information is available and steps In which useful complle- 
tlme Information Is not available ) 
FOR [ each step ] DO 

[ Consider the step a separate complex operation and 
reapply this algorithm | 

END 



soft Constarr c^sef X> start o> source string 

toff Corstarr offset to start of target 

dp Reenter conta«wQ base po»rter 

LOW eofUO<dp),regt ; FV* up first four bytes 

LOW aorf-4<dp).reg2 . Pickup neat lour bytes 

STW regt.tofl.Otdp) , Store Iks) lour oyws 

STW reg2.torf-4(dp) ; Store next tour bytes 

Fig. 2. Eight-byte move portion ol a moveitill operation 

compile-time information, the compiler decides that it 
would require too many in-line instructions. Therefore, the 
fill is broken down into two steps: full-word fills and par- 
tial-word fills. A 22-byte fill is really a five-word fill fol- 
lowed by a two-byte fill. The five-word fill step is a repeti- 
tive step, and is performed in Millicode. The two-byte step 
would require interpretation at run-time if performed in 
Millicode, so it is performed in-line. The generated code 
sequence is shown in Fig. 3. 

In the above example, only four instructions were de- 
voted to overhead: loading the address of the beginning of 
the fill, loading the fill character, branching to Millicode. 
and returning from Millicode. 

One extremely important side benefit of the HP Precision 
Architecture solution is that as more compile-time informa- 
tion is retained and processed, it becomes easier to recog- 
nize special cases and take advantage of them. For example, 
the incrementing of the COBOL unpacked decimal (ASCII 
display) data type is typically performed in CISC systems 



Fig. 1. HP Precision Architecture algorithm tor complex op- 
eration code generation. 



Although Millicode serves many of the functions of a 
CISC machine's microcode, there is an important differ- 
ence. Millicode doesn't suffer from the severe space restric- 
tions placed upon microcode. This is particularly impor- 
tant for the class of complex operations in which useful 
information is available at compile time, but whose custom 
in-line code sequences would be too lengthy. Instead of 
wasting the compile-time information, a series of Millicode 
routines can be created, each tailored to a particular com- 
bination of compile-time information. Unlike a CISC sys- 
tem's microcode program, these special-purpose Millicode 
routines can make best-case assumptions. The compiler deter- 
mines the best one to call based on the available information. 

A good example of the HP Precision Architecture com- 
plex operation code generation process is an eight-byte 
move followed by a 22-byte blank fill in which all align- 
ment and length information is known at compile time. 
This is a reasonably common type of operation when deal- 
ing with character strings. Following the algorithm of Fig. 
1, the compiler first examines the move operation and de- 
termines that it is fully described by compile-time informa- 
tion. Then it finds that the code size to perform the move 
is small enough, and performs the move entirely in-line 
with no interpretive overhead (see Fig. 2). Next, the fill 
portion is examined. Although it also is fully described by 



soft Constant offset lo start ol source siting 

toff - Constant offset lo start ol target 

dp Register containing base pointer 

blanks Register containing 0»20202020 (all blanks) 

MRP Millicode return pointer 

sr ■ Millicode space register 



In-Line Code for Move | 



Fill Millicode 







• { entry points ! 


LOW 


soff+0(dp),reg1 


• ( continue to 




LOW 


*off-»4<dp),reg2 


• < S$fllL31 




STW 


reg1.totf+0(dp) 


tmj 




STW 


reg2.toff - 4(dp) 


STBYS.b.m 


ARG0.4(ARG1) 






SSfill 6 




{ m ime Code lor Fill ) 


STBYS.b.m 


ARG0.4(ARG1) 






SSfill.5 




LDO 


toff • 8(dp),ARG1 


STBYS.b.m 


ARG0,4(ARG1) 


BLE 


SSfill 5<sr.O) 


SSflll.4 




COPY 


blanks.ARGO 


STBYS.b.m 


ARG0,4(ARG1) 


STH 


ARG0,0(ARG1) 


SStlll.3 








STBYS.b.m 


ARG0,4(ARG1) 






SSfill 2 








STBYS.b.m 


ARG0,4(ABG1) 






SSHIL1 








BE 


O(O.MRP) 






STBYS.b.m 


ARG0,4(ARG1) 



Fig. 3. Complete moveilill code sequence. 
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by converting to packed decimal, doing the normal decimal 
addition, and then converting back to unpacked decimal. 
On HP Precision systems, the compilers recognize that in 
the typical case no carry will be generated and only the 
least-significant digit of the number will have to be altered. 
It then becomes possible to generate extremely fast code 
sequences to perform the operation (see Fig. 4). In one 
COBOL processor benchmark, an HP Precision Architecture 
machine outperformed its CISC counterpart at unpacked 
decimal incrementing by a factor of 7. above and beyond 
MIPS rate differences. 

Among the special cases recognized by the first genera- 
tion of HP Precision Architecture optimizing compilers are: 

■ Unpacked decimal equality comparison. Compare the 
least-significant digits in-line. If they are not equal, the 
operation is complete. Otherwise, complete the compari- 
son in a highly specific Millicode routine. 

■ Unpacked decimal rounding. Compare the digit to be 
rounded with the constant 5 in-line. If it is less than 5, 
simply truncate. Otherwise, perform the special display 
increment on the following digit and then truncate. 

■ Unpacked decimal multiplication by power of 10. Sim- 
ply shift the number. 

■ Unpacked decimal division by power of 10. Simply shift 
the number. 

■ Unpacked decimal remainder by power of 10. Truncate 
or shift the number. 

■ Unpacked decimal multiplication by a small constant. 
Turn into a series of possibly scaled additions. 

■ Unpacked decimal decrement. Same as unpacked deci- 
mal increment. 

■ Unpacked decimal addition of constant. Prebias the con- 
stant at compile time and use a special-purpose display 
addition Millicode routine. 

■ String comparison. Compare one to four of the leading 



»bo« ■ Constant ottsel to sign digit oU-digil unpacked decimal ilem 

tablel - Register containing pointer to base of set of translation tables 

table? ■ Register to hold pointer to increment table 

sign - Register to bold original sign digit 

xaign - Register to bold incremented sign digit 

sr - Millicode space register 



LDB 


sboff(dp).sign 


Load sign digit 


LOO 


inc.oftset( tablel ),lable? 


Get pointer to increment table 


LD8X 


sign(tabl»2).xslgn 


Translale increment sign digit 


COMIB, 


' > O.xslgn. all done 


Branch over millicall if no carry 


STB 


xslgn.sboff(dp) 


Store incremented sign digit 


LDO 


sbof1(dp|,ARG0 


Get pointer to sign digit 


BLE 


S$glnc.cont(sr.O) 


Continue increment m Millicode 


LCH 


4.ARG1 


Pass Millicode length of 4 


•II done 







bytes of the two strings in-line. If they are not equal, the 
operation is complete. Otherwise, finish the operation 
using a special-purpose Millicode routine. 
■ Byte move in which target address = source address + 
1. Turn operation into fill. 

The exploitation of special cases is really just an exten- 
sion of the overall HP Precision Architecture complex op- 
eration solution. The code sequences for these special cases 
combine the quest for performing no unnecessary work 
with the RISC philosophy of tuning for the most common 
cases. 

A legitimate question to ask, however, is "Why couldn't 
this same strategy be used by CISC systems?" The answer 
takes us back to the beginning — fast simple operations. The 
HP Precision Architecture complex operation solution 
picks up its performance improvements by using compile- 
time information to generate custom in-line code sequences 
and to select highly specific Millicode routines. Both the 
in-line sequences and the Millicode are composed of the 
fast simple operation building blocks provided by HP Pre- 
cision Architecture. If a CISC were to attempt to build the 
same mechanism out of its regular instruction building 
blocks, the combined overhead incurred would quickly 
outweigh the advantages of using compile-time informa- 
tion. 

Compiler Performance Measurements 

To compare the relationship of compilers and computer 
architectures of different machines accurately, it is first 
necessary to factor out differences in the machines' raw 
speed. To compute the following performance ratios, the 
elapsed time required to perform each benchmark on a 
particular machine was multiplied by that machine's MIPS 
rate. For example, if machine A is rated at 2 MIPS and 
executes a benchmark in 20 seconds, its adjusted time 
would be 40 seconds. If machine B, rated at 3 MIPS, per- 
forms the same benchmark in 10 seconds, its adjusted time 
would be 30 seconds. We could then say that the compiler/ 
architecture combination of machine B outperforms that 
of machine A by a factor of 1.33 (40/30). above and beyond 
differences in the machines' MIPS ratios. 

In the following table, performance ratios greater than 1 
suggest that the HP Precision Architecture/compiler re- 
lationship is more efficient than its CISC counterpart. 

Because the HP Precision Architecture code generation 
strategy calls for custom code sequences which depend on 
many factors, the performance ratios for a single type of 
complex operation can vary widely from one specific exam- 
ple to another. Each item in this table represents a class 
of operations. The compiler performance ratios were gen- 
erated by comparing an HP Precision Architecture machine 
with representative CISC machines from HP and other man- 
ufacturers. 



Fig. 4. Unpacked decimal increment code sequence. 
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Compiler 
Performance 



Moves Fills Ratio 

Short move | - 1 0 bytes) 2.7 

Long move ( — 80 bytes) 0.6 

Short fill (-10 bytes) 1.2 

Longfill (-80 bytes) 0.5 

Average move/fill 1 .4 

Comparisons 

Byte string comparisons 1.1 

Unpacked decimal comparisons 2.0 

Unpacked decimal comparisons 1 .0 

with digit validation 

Packed decimal comparisons 1.1 

Packed decimal comparisons 0.7 

with digit validation 

Decimal Addition 

Unpacked decimal addition 3.2 

Unpacked decimal addition 1.6 

with digit validation 

Packed decimal addition 2.1 

Packed decimal addition 1.3 

with digit validation 

Multiplication 

64-bit integermultiplication 0.7 

Packed decimal multiplication 1.5 

Packed decimal multiplication 1.3 
with digit validation 

Subscripting 

Small subscripted move; integer 2.9 
subscript 

Small subscripted move; 3.5 

unpacked decimal subscript 

Small subscripted move; packed 1.0 

decimal subscript 
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Viewpoints 



A Viewpoint on Calculus 

Presented to the Mathematics Panel of the American 
Association for the Advancement of Science on 
April 5, 1986 

by Zvonko Fazarinc 



I T IS MY BELIEF lhal the leaching of infinitesimal calculus is 
creating serious problems for science and mathematics educa- 
tion at every level. It is also affecting our research and indus- 
trial organizations in a negative sense. For both these reasons it 
is essential that we make drastic changes in the teaching of cal- 
culus, match it to the needs of the modern world, and equip our 
future scientists with the necessary problem solving skills." 
Consider this: 

The relatively long mathematical maturation period that must 
precede the introduction of calculus pushes it into the late second- 
ary levels or even college. (More than half of the high schools in 
our nation are no longer teaching it because of the shortage of 
qualified science and mathematics teachers.) 

The physics instruction is tied very closely to the calculus and 
hence the teaching of physics is also pushed out in time. As a 
consequence, many precious early school years, characterized by 
extreme knowledge acquisition rates, are being lost for physics 
and other sciences relying on calculus. 

The ever increasing mass of new knowledge that needs to be 
taught at the higher levels calls for more introductory college ma- 
terial to be pushed back into the secondary schools. This is contrary 
to the current trend characterized by the declining ability of sci- 
ence instructors to teach abstract subjects. These include even the 
elementary physics, which is very simple in concepts but has been 
unnecessarily "abstractized" by the calculus. 

When the calculus finally gets taught, its limitations are rarely 
discussed, and when encountered, are often dismissed as special 
cases. The myth of calculus' power propagates into our research 
and industrial organizations and is responsible for considerable 
loss of lime in fruitless attempts to apply it to real-life scientific 
or engineering problems. This statement is based on my own ob- 
servations of above average scientists in the R&D environment. 
The fact is that we cannot solve algebraic equations of order higher 
than four. Consequently, we cannot solve differential equations 
of order higher than four either. Furthermore, we can solve only 
a handful of very special nonlinear differential equations. It is 
therefore fair to say that calculus is only applicable to linear prob- 
lems of order four or less. That leaves out the vast majority of all 
interesting problems. 

Regardless of how unpopular the view presented here may seem, 
it merits a thorough scrutiny by this panel. And here are some 
suggestions for change. 

We must recognize that we live in a computer era. Almost as a 
rule, we submit our closed-form solutions derived with or without 
the use of calculus to the computer for evaluation, plotting, or 
some other transformation. But we must also recognize that mod- 

• I wish lo explain that my views nave evolved through many years ol elation over the 
beauty ot infinitesimal calculus, which 1 practiced with enthusiasm to a nagging realization 
that in real problem solving situations I hap to resort to other more mundane methods It 
is only in the last couple ot years that I have mastered the courage to voice my opinion — m 
a circle ot trusted tnends at first and later at small gatherings only to encounter a religious 
adherence to calculus Nevertheless in open discussions the initial shoe* usually gave 
way to an agreement with the views presented herein. 



F=«t): 




This integral cannot be solved for arbitrary t(t) Now add friction: 
d 2 s(t) r ( ds(t) \ 2 

This Is a nonlinear differential equation that has no closed-form solution . 



is already a simplification in tlsell. Einstein saysm - m 0 (1 ~v ? /c*) 
Then: 

Fig. 1 . The present method of teaching the physics ol simple 
motion 



ern computers are capable of doing a great deal more than that. 
They can accept and then apply the fundamental laws of physics 
to a variety of problems. They can make realistic predictions of 
the behavior of nonlinear, highest-order systems, and they have 
become significant problem solvers for academia and industry. 
Industry today commits hundreds of thousands of dollars to pro- 
duction of a single integrated circuit, the performance of which 
has only been predicted by a computer simulation. 

In all of these impressive successes no closed-form solutions 
are being used. We should therefore teach our future generations 
not to strive for closed-form solutions at all. but to accept the fact 
that we are unable to produce them for the majority of useful 
cases. As an alternative, we would teach children at an early age 
how to pose the problem to the computer. This is not to say that 
we teach them programming. We teach them how to cast the prob- 
lem into the form understood by the computer. (Programming will 
be second-nature to future generations because of the prevalence 
of computers and advances in high-level programming languages. 
Already today there are over one million computers in public 
schools in the U.S.A.) 

To put it even more bluntly, let us not teach our youngsters 
how to derive a differential equation that we cannot solve anyway 
and the computer does not understand. Teach them instead how 
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m = constant: 



Momentum M = mv 
AM = mAv = FAt 



AS = vAt 



M 



^C(x.t) 



m- 



Av 

"aT 



At— 0 



ov<t) 



is 
At 

osdi 

dt 



v(t) 



d J s(t) 



= F 



Fig. 2. How we got where we are today by forcing the limit 

to express the problem in the finite difference form. This makes 
the problem ready for a direct computer solution, in most cases 
with immediate graphic feedback. Al the same lime, this approach 
circumvents the pitfalls of complex derivations and removes the 
abstraction of infinitesimaJ calculus from the problem. 

We could start teaching physics much earlier, since only elemen- 
tary algebra is needed to support finite differences. We should be 
able to cover more science fundamentals and would provide the 
extra college time that is so badly needed to meet the job require- 
ments of the modern world. We can start teaching physics with 
scientific honesty. We no longer need to ignore friction, air resis- 
tance, nonhomogeneity of media, and other "complications" Ihe 
infinitesimal calculus cannot deal with — complications thai are 
all around us and that keep reminding us that our solutions are 
a mere shadow of the truth. Let us remove the crutch we use when 
Ihe observations do not match our predictions. Let us use instead 
the mathematical tool thai has no excuses. This can only make 
our future generations more honest about their findings. 

These points are illustrated by the two examples that follow. 

Fig. 1 shows the present method of teaching the physics of 
simple motion. It points out Ihe breakdown of infinitesimal cal- 
culus even for the simplest case of friction. Fig. 2 shows how we 



Av - — At 
m 



As = vAt 



v(t 4 At) = v(t) + £ At s(l+ At) ■ s(t) + v(t)AI 



In computer |argon 



v = v i ±- At s = s t vAt 

m 



For a relaiivislic mass moving in ihe presence ot friction, define the 
friction coefficient k and v 0 . So. F 0 . and m 0 . Then execute the following 
lor each lime step At: 



v = v + ±- At 
m 



s * vit 



kv 2 



m - rn 0 /V(1-v 2 /C 2 ) 

Fig. 3. Going back to the original dilference description 
brings nonlinear cases under control and lormulates the prob- 
lem lor computer solution. 



C(x.O> = l(x> 
CfO.I) - 0 



C(x.t) 



1 



* C(x,t) _ „ ^C(x.t) 



X'2-nOt 

This integral cannot be solved to' arDitrary »(x). 

m rfx 2 
is already a simplification In reality D = D(x). Then 

aC(x.') = dDW «M * DM 
M dx ax dx 2 

Fig. 4. The present method of formulating diffusion problems. 

got to where we are today by forcing the limit on the finite differ- 
ence formulation. Fig. 3 illustrates how by going to the original 
difference description we regain control over nonlinear cases and 
obtain at the same time the computer-understood formulation of 
the problem. Note how the implicitness of time in Ihe computer 
jargon simplifies the notation. 

Fig. 4 illustrates Ihe same point for the case of diffusion, which 
requires a partial differential description. Nonhomogeneous media 
are excluded from Ihe solution domain because of the limitations 
of infinitesimal calculus. Fig. 5 shows how we have abandoned 
the elegant Einstein's finite difference formulation of the problem 
for the partial differential equation, which we can solve only for 
a few trivial cases. Finally. F'ig. 6 shows how by going back lo Ihe 
original difference description we are able to overcome Ihe limita- 
tions and avail ourselves of Ihe formulation thai is easily under- 
stood by the computer and by the human. Noteworthy is the reduc- 
tion of the degree uf the formula because of Ihe implicitness of lime. 

It goes without saying that the finite difference calculus has its 
own problems, also. The numerical instabilities of discrete inte- 
gration formulas, the errors arising from finite sampling intervals, 
and liuie-consumingcomputations in the presence of widely sepa- 
rated eigenvalues are. just some of them. There is no doubt that 



C(x - Ax.t) C(x.t) C(x + Ax.t) 

Equal chance p that a fraction ot C moves left or right 

C(x,t > At) = pC(X - Ax.t) + pC(x ♦ Ax.t) ♦ (1 2p)C(x,t) 

C(x, t 4 At)-C(X.t) = Ax' (Cfx Ax.t) f C(x+Ax.t) -2C(x,t» 
At " P At Ax 2 

At-0 

Ax - 0 



■Cfx,!) 
rft 



= llm 



( Ax'Wq 

V At / ax 



2 



Fig. 5. How we abandoned Einstein's dilference formulation 
of the diffusion problem by forcing limits. 
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C(x.t + At) = C(x,t) + p[C(x-Ax.l) + C(x + Ax,t)-2C(x,t)] 

In computer jargon: 

C(x) = C(x) + p(C(x-1) + C(x*1)-2C(x)) 

For variable diffusivity and arbitrary initial condition C(x.O) define 
p(x) = Dfx - Ax/2)At/Ax 2 tor all values ot x between 0 and N. Then 
execute the following for each time step At: 

for x = 0 to N 

C(x) = C(x) + p<x)tC(x - 1 ) - C(x)] + p(x * 1 )(C(x + 1 ) - C(x)j 

Fig. 6. Going back to the original difference description over- 
comes the limitations. 

good research is needed to scrutinize the finite difference calculus, 
in particular the partial domain. We should never find ourselves 



glossing over the limitations uf discrete mathematics hut should 
courageously point them out. Despite these problems, the finite 
difference approach to problem solving opens the way to the major- 
ity of cases that the infinitesimal calculus leaves stranded. 

What about the future of infinitesimal calculus? 1 believe that 
it will remain a research tool for the limited set of problems it can 
handle. It should be taught as a special case of finite difference 
calculus but much less time should be devoted to it. It does not 
have to be addressed until the college level, when the readiness 
for abstraction has been developed. This will remove the burden 
from secondary-level teachers who, I ike students, will have a much 
easier time dealing with finite differences. They will also be able 
to teach physics on more concrete grounds and will likely improve 
the quality of science instruction all around. The removal of the 
abstractions introduced by the infinitesimal calculus could signif- 
icantly compensate for the declining number of qualified science 
instructors. The substitution of discrete calculus for the infinites- 
imal would also beneficially affect the students' dropout rate, com- 
puter literacy, and problem solving skills in general. 



Hewlett-Packard Company, 3200 Hillview 
Avenue. Palo Alto. California 94304 




Bulk Rate 
U S Postage 
Paid 

Hewlett-Packard 
Company 



Technical Information from ins Laboratories ot 
Hewiett-Pockord Company 

Hewlett-Packard Company 3200 Htllview Avenue 
Palo Alto California 94304 USA 
Hewlett-Packard Central Mailing Department 
P.O Box 529. Starmaan 16 
1160 AM Amsteiveen The Netherlands 
Yokogawa Howlori Packard Ltd SuginarruKu Tc*yo 168 Japan 
Hewlett-Packard (Canada) Ltd 
ay Drive, Miss 

CHANGEOF ADDRESS: 



To suo-sc'ibe. change your aaoress. of delete your name trom our mailing M, send your request to Hewlett Packard 
Journal 3200 Hiltview Avenue Palo Alto CA 94304 USA include your oio address nee! it any Allow 60 days 



5953 8558 



© Copr. 1949-1998 Hewlett-Packard Co. 



